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1. Introduction 



For readers not familiar with concentration inequalities, we recommend Ledoux (2001), 
Mitzenmacher and Upfal (2005) and Dubhashi and Panconesi (2009). 
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The transportation inequality cost method to prove measure concentration was initiated 
by Katalin Marton. Marton (1986) proves the blow-up lemma (a weaker version of measure 
concentration in Hamming distance). 

Marton (1996a) proves measure concentration in Hamming distance for countable state 
Markov chains. For a homogeneous Markov chain with state space Q, and transition proba- 
bilities Pi j, let us denote 

a := max(i T y(Pj ] ., Pj- .), (1.1) 

then Proposition 1 of Marton (1996a) proves that measure concentration holds with constants 
1/(1 — a) 2 times worse than in the independent case. 

Marton (1996b), Marton (1997) extends this result, and proves Talagrand's inequality for 
Markov chains, with constants 1/(1 — a) 2 times worse than in the independent case. 

The Markov chain setting was further generalized to a class of random processes, for 
Hamming distance, in Marton (1998a). 

Talagrand's convex distance inequality, for a larger class of random processes, was in- 
dependently proven in Marton (1998b) and Samson (2000). The latter also proves a weak 
version of Talagrand's suprema of empirical processes inequality. 

In Marton (2003), these results are further extended to prove concentration inequalities 
for a larger class of functions. 

Chazottes et al. (2007) (using an elementary martingale-type argument) and Kontorovich 
(2007) (using martingales and linear algebraic inequalities) prove concentration inequalities, 
in Hamming distance, for a class of mixing coefficients, similar to those of Samson (2000). 
For homogeneous countable state Markov chains, their results are the same as Proposition 
1 of Marton (1996a). 

Lezaud (1998a) proves Bernstein - type concentration inequalities for finite state Markov 
chains, for empirical means n~ l Y^=i f{Xi) ( an d \ J s=0 f(X s )ds in the continuous case). For 
reversible Markov chains, the constants in the exponents depend on the spectral gap of the 
chain. Lezaud (2001) generalizes this to Markov processes with general state space, and 
proves a Berry-Esseen bound for the empirical mean. 

The purpose of this paper is improve these results, and show that concentration inequalities 
for Markov chains are in fact governed by the mixing time of the chain. This work grew out 
of the author's attempt to solve the "Spectral transportation cost inequality" conjecture 
stated in Section 6.4. of Kontorovich (2007). 

1.1. Main definitions 

In the following, we will consider dependent random variables X = (X%, . . . , X^) taking 
values in some set 

A := Ai x . . . x A N , 

and let P also denote the law of X, i.e. X ~ P. Let Y := (YJ., . . . , Y/v) be a random vector 
taking values in A, and suppose Y ~ Q. Denote [N] := {1, . . . , TV}. 

Assumption 1.1. For notational convenience, we will suppose that A is discrete. The con- 
tinuous case can be treated similarly, as it is done in Samson (2000). 
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We will denote a coupling of P and Q by 

n[X~P,Y~Q]. (1.2) 

An example: let ir[X ~ P, F ~ Q] be the maximal coupling of P and Q (see Lindvall 
(1992)), then 

n[X^Y}=d TV (P,Q). 

We will need to refer to subsets of our vectors, let 

X<k '■— (Xi, . . . , Xk), X> k := (Xk, ■ ■ ■ ,X n ). (1.3) 

The laws of these "subvectors" will be denoted by P<k, and P> fc , respectively. 

The following is the most important definition of this paper. It has appeared in Marton 
(2003). 

Definition 1 (Marton coupling). Let X := (Xi,...,Xj^f) be a vector of random variables 
taking values in f2 = fii x . . . x Qj^, with law V. We define a Marton coupling for X as a 
set of couplings 



M:=\M 



<*>i+i n 



'P> i+ i(-\x<i),X'>i+i ~ P^ + i(-|x<i_i,^) ) 

i.e. for each i < J\f and (x^x'J, Ai l := Ai l (-\x<i, x'j) is a coupling between X> i+i ~ 
Pg+i(-|x<i) and X'^ i+1 ~ 

'P>i+i('\ x <i—ii%i)) satisfying the following condition: 
for every (x<j,x-) with x { = x-, M l [X^ i+l = X'> i+1 \x<i, £•] = 1. (1.4) 
We define the mixing matrix of M,, F := (Fi,j)i,j<N as an upper diagonal matrix with 
Tij : = 1 fori < Af, and F^j := sup M l [Xj ^ X'Ax<i, x[] for 1 <i < j <Af. 



Remark 1.1. Samson (2000), Chazottes et al. (2007), Chazottes and Redig (2009), Kontorovich 
(2007) use a similar construction, but assume that Ai l is the maximal coupling between 

P> i+ i(-\x<i) and i^ +1 (-|x<i_i,a;J) 

(the coupling that "achieves" the total variation distance, see Definition 8, or Lindvall 
(1992)). We will use the extra freedom provided by Definition 1 for our theorems in this 
paper. 

For homogeneous Markov chains, and M l defined as the maximal coupling, we get 

( 1 a a 2 a 3 . . . \ 
1 a a 2 ... 



r (Ti,j)i,j<n < 



ooo 



;i.5) 



which gives \\F\\ < (a is defined as in (1.1),). 
Caveat lector: although it is true that 



F itj > maxd TV (P j (-\x< i ),P j (-\x< i _ 1 ,x' i }) 



the equality does not holds in general. 
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We will use the definition of partition of a set: 

Definition 2 (Partition). A partition of a set S is the division of S into disjoint non-empty 
subsets that together cover S . Analogously, we say that X = (Ai, . . . ,X n ) is a partition of a 
set of random variables X := (Xi, . . . ,X^) if (Ai)i<£< n are disjoint, and together cover X. 
For a partition X of X , we denote the number of elements Xi by s(Xi) (size of Xi), and 
call s(X) := max!<j< n s(Xj) the size of the partition. 

Finally, we denote the set of indices of the elements of Xi by T(Xi), i.e. Xj G Xi if and 
only if j G T(Xj). For a set of indices S C [N], let X s := {Xj : j G S}. In particular, 

X i =X X(Xi)- 

The state space of X will be denoted by A := Ai x . . . x A^r and the state space of X is 
denoted by A := Ai x . . . x A n , with Aj = Ax(x % )- 

In Section 4.5 and 4.6 of Levin, Peres and Wilmer (2009), the mixing time of a time 
homogeneous chain is defined the following way: 

Definition 3. Let X\, X%, A3, ... be a countable state, time homogeneous Markov chain with 
transition matrix P, state space Q, and stationary distribution tx . 
Let us denote 

d(t) := sup dry (P t (x, ■), 7r) , 
t m ix(e) := min{t : d(t) < e} 

and 

tmix ■ ^mix(l/4). 

We will use the following alternative definition, which also works for time inhomogeneous 
Markov chains: 

Definition 4. Let X\, . . . , Xn be a countable state Markov chain with state space fli x . . . x 
Qn (%■£• Xi G Qi). Let us denote the minimal t such that Pi +t (-\Xi = x) and Pi +t (-\Xi = y) 
are less than e away in total variational distance for every 1 < % < N — t and x,y G Qi by 
r{e), i.e. for < e < 1, 

r(e) := min 

t G N : max ( sup d TV (P i+t (-\Xi = x), P i+t (-\Xi = y)) < e 



Ki<N-t 



x,y£fli 



Remark 1.2. One can easily see that in the case of homogeneous Markov chains, by triangle 
inequality, one has 

r(2e) < t mix (e) < r(e). 

In the following, based on Section 20 of Levin, Peres and Wilmer (2009), we briefly review 
some definitions about finite (or countable) state Markov chains with continuous time. 

Let ($fc)^L ^ e a time homogeneous Markov chain with transition matrix P, and countable 
state space fl. Let (Tj)™ =1 ~ exp(l) be i.i.d. exponentially distributed random variables 
independent of ($fc)£L . Let Sk = ]Ci=i^i f° r ^ > I, and define 



X t := $ fc for S k < t < S k+1 . 
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The heat kernel H t is defined as 

H t (x,y) :=F(X t = y\X = x), (1.6) 

the one can show, that in matrix form, for finite state space Q, we can express H t with the 
matrix exponential: 

^ = exp(t(P-I)), (1.7) 

here I denotes the \VL\ x \VL\ identity matrix. 

Theorem 20.1 of Levin, Peres and Wilmer (2009) proves that for irreducible P, there ex- 
ists a stationary distribution 7r for X t . We define the mixing time of continuous time time 
homogeneous chains as in (20.7) of Levin, Peres and Wilmer (2009): 

Definition 5. 

C£(e) := inf \t>0: supd TV (H t (x, •), tt) < ei , (1.8) 

and denote t%£ := 

For time inhomogeneous, continuous time chains with countable state space Q, we denote 

H tuta (x,y) :=F(X t2 =y\X tl =x), (1.9) 

and define the mixing time analogously to Definition 4: 
Definition 6. 

r cont (e) := min it > : sup sup d TV [H s , s+t (x, ■),H SjS+t (y, ■)]) , (1.10) 

and denote r cont := r cont (l/4). 

For finite state, reversible, aperiodic, irreducible chains, in discrete time, write the eigen- 
values of the transition matrix P as 

1 = \ 1 > A 2 > ... > A|n| > -1. 

Denote 

A* := max{|A| : A is an eigenvalue of P, A ^ 1}, 7* := 1 — A*, 7 := 1 — A 2 . 

We call 7* the absolute spectral gap, and 7 the spectral gap. Obviously, 7 > 7*. The relation 
between the mixing time and the spectral gap is given by the following proposition: 

Proposition 1.1. For reversible, irreductible, aperiodic chains in discrete time with finite 
state space Q, we have 

wo > - 1) log (1) > g - 1) >og (1) , (lid 



tmixi^) — 



1 , / 

log 



7* \ e 



;i.i2) 
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For continuous time, time homogeneous chains with reversible, irreducible P , and finite state 
space Q, similar results hold: 

C»)>(^-l)lo g (i)>(i-l).og(i), (1.13) 



^mix ( e ) — 



1 , / \/M 

— log 

7* 



1.14) 



Proof. (1.11) follows by Theorem 12.4 of Levin, Peres and Wilmer (2009). (1.12) is proven, 
for example, in Chawla (2010). (1.13) and (1.14) are left to the reader as exercise. □ 

1.2. Additional definitions 

In this section we introduce some additional notations, that will be used in the statement of 
some of our theorems. For a (not necessarily time homogenous) Markov chain Xi, . . . ,X^, 
denote 

r ndB := o gf i r( C )/(l-e) a (1.15) 
C== i?f r(6)/(l-v^) 2 . (1.16) 

For time homogeneous chains, for some integer t > 0, denote 

Vmin (to):= inf eU&J-^l (1.17) 

0<e<l l — e 

In the time continuous case, we define r^Jf, r ^ffi , and r)™^(t ) analogously. The following 
proposition gives some estimates on these quantities: 

Proposition 1.2. For time homogeneous chains, the following inequalities hold: 

r min < inf t mx (e/2)/(l - e) 2 < —t mix < 2.62t mix , (1.18) 

0<e<l 49 

4 in < inf t mx {e/2)/{l - ^ef < 4A3t mix , (1.19) 

0<e<l 

^min(io) < 4~l.wl • it mix . (1.20) 

The same inequalities hold in the continuous case, with T m \ n replaced by r^-"*, ?7 m i n replaced 
by r}™^, and t mix replaced by t%£. 

Remark 1.3. fn many cases, the Markov chain exhibits a cutoff, i.e. the total variation 
distance decreases very rapidly in a small interval, see Figure 1 of Lubetzky and Sly (2009). 
If this happens, then r min m r^ in m t mix , and rj min (t ) decreases very quickly for t > t mix . 

Proof. The first inequality in (1.18) follows by triangle inequality, the second one by taking 
e = 1/8, and noticing that t m i x (l/16) < 2t mix . The other inequalities are similar. □ 
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2. Results 

2.1. Results by Marton couplings 

For our first result, we need to define a distance: 

Definition. For C G R^, x, y G fii x . . . x f2/v, we say that the C weighted Hamming 
distance of x and y is 

JV 

dc{x,y):=Y J C*t[x^y t l (2.1) 
i=x 

and the C weighted Hamming distance of two measures, P and Q on Q is 

dc(P,Q):= inf VC^X^Yi]. (2.2) 

tt(X~P,Y~Q) 

1=1 

The relative entropy of P and Q will be denoted by 

D(g||P):=^P(x)log(g|). (2.3) 

Theorem 2.1. Let X = (Xi, . . . ,X^) be a sequence of random variables, X G A,X ~ P. 
Let X = (Xi, . . . ,X n ) be a partition of this sequence, X G A, X ~ P. Suppose that we 
have a Marton coupling for X with mixing matrix T. Then for any distribution Q on A, any 
c G R+, we have 

d c (Q,P) < \\T-C(c)\\y/±D(Q\\P), (2.4) 

with C(c) G R™ defined as 

Ci(c) := c i f° ri < n. (2.5) 

jex(Xi) 

Corollary 2.1 (Mcdiarmid's bounded differences inequality). 

Let X, X, Ai,T and C(c) as in Theorem 2.1. Let f : A — > R be a d c Lip schitz function (i.e. 
f(x) — f(y) < d c (x, y) ) for some c G R+ , then for any A G R ; 

logE(e w W )>) < >m^mi < y-\\nwx) (2 6) 

In particular, this means that 

P (/(X) > Ef(X) +t),¥ (f(X) < Ef(X) -t)< exp L^^ j . (2.7) 

Corollary 2.2 (Mcdiarmid's inequality for Markov chains). Let X := (X\, . . . , Xn) be a 
(not necessarily time homogeneous) countable state Markov chain, taking values in state 
space A = Ai x ... x A^, with mixing time r(e). 

Let f : A — > R be a d c Lipschitz function for some c G R+ , then for any A G R, 
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logE ( e A(/W-E/ ( x))) < A2 -H c ^- Tmin , (2.8) 

This implies that 

P (f(X) > Ef(X) + 1) , P (/(X) < E/(X) - t) < exp ( ~f ) . (2.9) 

V 1 1 H I ^"min / 

Corollary 2.3 (Hoeffding inequality). Let X, X, Ai, andT as in Theorem 2.1. Suppose that 
fi : At — > [ai,bi], i < N. Let C{ := b{ — a i} and define C{c) as (2.5). Define S := ^2f=i fi{Xi), 
then 

P (S > ES + t) , P (S < ES - t) < exp L^— \ . (2. 10) 

Corollary 2.4 (Hoeffding inequality for Markov chains). First let X = (X\, . . . , X^) be 
a (not necessarily time homogeneous) Markov chain taking values in some countable space 
A := A x x . . . A N . Suppose that fi : Aj — > [aj, bi], i < N. Define S := fi(-^i)> then 



P (S > ES + t) , P (S < ES - t) < exp 



-It 2 



7"min A-^j=l(^ ^ 



Aou> suppose, in addition, that X is time homogeneous, and Ai — . . . — A n — Q. Suppose 

that f : f2 — >■ [a, b). Let t > ("burn-in time"), and denote Z : = — ^yj^- ^ • 
T/ien /or every t > 0, 

P f Z > + (b - ^^) + A p f Z < E n (f) ~ (b -^ {to) - t) (2.11) 



< exp 



N — t 
-2(N-t )t 2 ' 



(b - a) 2 ■ T n 



For our second theorem, we will need to define the d 2 distance of two measures on A (as 
in Samson (2000), and Marton (2003)): 

Definition. Let P, Q be two measures on A, then their d 2 distance is 



d 2 (P,Q):= inf 

w(X~P,Y~Q) 



N 



1/2 



(2.12) 



J2'$2'ir[Xi¥:Vi\Yi = Vi] 2 -Q(v) 

yeA i=l 

inf sup E, (Va^llI^yA (2.13) 



where a : A — > is a vector valued function. 

Remark 2.1. The equivalence of these two equations, and the triangle inequality for d 2 
follows by Lemma B and Lemma A of Marton (2003). 
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Theorem 2.2. Let X, X, M and T be as in Theorem 2.1. Let {^i,j)i,j<n ■ — ( \Z^i,j)i,j<n- 
Then for any distribution Q on A, 



d 2 (P,Q)<\\ 1 \\\/s(X)-2D(Q\\P) 1 (2.14) 

and 



d 2 (Q,P) < h\\^s(X)-2D(Q\\P). (2.15) 

Remark 2.2. This is a slight abuse of notation, because we also denote the spectral gap by 
7, but since they will never appear in the same formula, they are easy to distinguish. 

Corollary 2.5 (Talagrand's convex distance inequality). Let X, X, M. and 7 be as in 
Theorem 2.2. Let A C A, and dx{x, A) be the Talagrand distance of x G A from A: 

(It(x,A):= sup mi d a (x,y). (2-16) 

Then 

E ( exp ( } a%(X, A) ) ) < -J—. (2.17) 

^ P \4 S (X)||7||2 TV J J J ~ P{A) 

For the following result, we will need to define a-self-bounding functions (these are similar 
to self-bounding functions, see Boucheron, Lugosi and Massart (2009)). 

Definition 7. Let Vl = Oi x . . . x . Let a, b > 0. 

1. We say that f : il — > R is a- (a, 6)-self-bounding if there is a : A — >■ R^" such that 

(a) f(x) - f(y) < E^Az-aiO)!^ ^ Vi] f or ever V x,y GQ. 

(b) cti(x) < 1 for every i < J\f,x G f2. 

( c ) J2i<x a i( x ) ^ af{x) + b. 

2. We say that f : Q — > R is weakly a- (a, b) -self-bounding if there is a : A — > R^ such 
that 

(a) f(x) - f(y) < ^2i< M ai(x)l[xi ^ yj for every x,y G Q. 

( h ) Y,i<N a i( x T ^ af(x) + b. 
Remark 2.3. It is easy to see that a-(a,b)-self-bounding functions are also weakly a-(a,b)- 
self -bounding. It is also easy to see that these are special cases of (a,b)-self-bounding and 
weakly (a, b) -self-bounding functions. 

Theorem 2.3. Let X , X, M. and 7 be as in Theorem 2.2. 

If f : A — > R is weakly a-(a, b) -self-bounding, then for every A > 0, 

E ( exp ( A (/(*) - Ef(X)) - ^/(I) + ) < 1, (2.11 



E(exp(-A(/(X)-E/(X)))) < exp 



A 2 || 7 || 2 (aE/(X)+6) 



(2.19) 
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thus for every t > 0, 



-t 2 



P (f(X) > Ef(X) +t)< exp - , (2.20) 

2||7|| 2 s(X) (aEf(X) + b + at) J 

P (f(X) < Mf(X) +t)< exp [ ^ ). (2.21) 

\2|| 7 || 2 S (X)(aE/(X) + 6) ^ 

The following corollary is an improvement of Theorem 11.2 of Dubhashi and Panconesi 
(2009) (the constant is 2 times better in the independent case): 

Corollary 2.6 (Method of non-uniformly bounded differences). Let X, X, Ai and 7 be as 
in Theorem 2.2. Suppose that there are a(x) := (ai(x), . . . , C(n(x)) real valued functions such 
that f : A — > R satisfies, for every x, y, 

f(x) < f(y) + ^2ai(x)l[xi yl (2.22) 

i<N 

or 

f(x) > f(y) -J2^(x)l[ Xi ± Vi \. (2.23) 

i<N 

Furthermore, suppose that there is a constant C such that for every x G A, 

N 
i=l 

Then for every t > 0, 

F(f(X) -E/(X) > t),¥(f(X)-Ef(X) < -t) < exp ( ,,7^^ ) • (2-24) 

The following is similar to Corollary 4 of Samson (2000): 

Corollary 2.7 (Concentration for convex functions on a cube). Let X , X , M. and 7 be as 
in Theorem 2.2. Additionally, suppose that Xi, 1 < i < N , take values in [0, 1]. Suppose that 
f : [0, 1]^ ->Riso 1-Euclidean Lipschitz, convex function. Then 

P(/(X) -E/pO > f),P(/(X) -E/(X) < -t) < exp [ ~f ) . (2.25) 

Theorem 2.3 also implies concentration for supremum of positive valued empirical processes 
(similarly to Theorem 2 of Samson (2000)): 

Corollary 2.8 (Concentration for positive valued empirical processes). Let X , X , Ai and 7 

be as in Theorem 2.2. Let (f it j : Aj — > [0,C])i<M,j<N be a family of positive valued functions, 
bounded by C . 
Define 

Z(x) := sup Aj( x i) and Z ■= Z ( X )- (2-26) 



j<M 



i<N 
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Then Z(x)/C is a- (1,0) self-bounding, and thus 

F(Z >EZ + t)< exp ( ^ I , (2.27) 

V 2 ll7ll 2s ( JSs: ) C '( EZ + *)/ 

F(Z <EZ -t) < exp [ ) . (2.28) 

~ \2|| 7 || 2 s(X)CEZ J 

Remark 2.4. This formulation is analogous to the one in Massart (2000). The original 
formulation in the literature is 

Z(x) := sup y^f(xi) 
for some countable set J 7 , our version is more general. 

Theorem 2.4 (Bernstein inequality). Let X , X , M. and 7 be as in Theorem 2.2. 
Suppose that f\ : Aj — > [— C, C], and let 



\i<N J 



(2.29) 



Let S := J2i<N fi( x d, then f° r ever y < A < 2v/ ^ ( j 



W||7ll 2 C" 



logEp [exp [X(S - E P S)]] < 2 H7llM^)Ep^A 2 

1 - 2 v / 2||7|| 2 s(X)CA 



thus for every t > 0, 



F{S > ES + t) ,F{S < ES-t) (2.31) 
-t 2 



< 



exp 



s(X)||7|| 2 (8^ + 4v / 2C-t) 



Remark 2.5. We do not require that E/i(X») = for i < N . 

Corollary 2.9 (Bernstein inequality for Markov chains). First let X = (X±, . . . , Xn) be 
a (not necessarily time homogeneous) Markov chain taking values in some countable space 
A := Ai x . . . Ajy. Suppose that fi : A$ — > [— C, C], i < N. Define S := ^i=i /»C^i)> and 



N 



\/:=E(^/,(X t ) 2 ). (2.32) 

Then for every t > 0, 



,i=i 



P (S > ES + t) , P (S < ES - t) C exp I ^^l" \ . (2.33) 

\8V + 4y/2CtJ 
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Now suppose, in addition, that X is time homogeneous, irreducible, aperiodic, with sta- 
tionary distribution ir, and Ai = . . . = A n = Q. Suppose that [— C, C\. Let t > 

("burn-in time"), and Z := ^" =t ^/^ ~ ■ I n this case, 

N 



Then for every t > 0, 

P ( Z > E^f) + 2 ^ min p°) +t], f(z< EM - 2C ^ min p 0) - t] (2.35) 

< 



cxp 



N-to 
-t\N-t f/r' nm 



8V + AV2 ■ (N - t )Ct J ' 

The following result is similar to Theorem 3 of Samson (2000): 

Theorem 2.5 (Weak version of Talagrand's suprema of empirical processes inequality). Let 

X , X , Ai and 7 be as in Theorem 2.2. 

Let (fij : Aj — > [—C,C])i<M,j<N be a family of functions, bounded by C. 
Define 

Z(x) := sup y fij(xi), and Z := Z(X), (2.36) 
j<M rri, 

J — i<N 



define 
and let 

Then for every A > 0, 
thus for every t > 0, 



G(X) :=logEe A(z - EZ) , (2.37) 
W-El^max/^pQ 2 ] . (2.38) 

\i<N j ~ M J 



G(A),G(-A)< ^hg^L, (2.39) 
V ' V 7 " 1-2 V / 2||7|| 2 |A|C V ' 



[Z >EZ + t) ,F(Z <EZ -t) (2.40) 
-t 2 



< 



cxp 



s(X)||7|| 2 (16W + Ay/2C ■ t) 
The same inequalities hold with the definition 



Z(x) := sup ^ fi,j(xi 
Our next result is an extension of Theorem 3 and 3' of Marton (2003): 



(2.41) 
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Theorem 2.6. Let X, X, Ai and 7 be as in Theorem 2.2. 
Suppose that f : A — > R satisfies one of the following: 

Condition 1. There are functions a.i : A — > R +; i < N , such that for any x,y E A, 

N 



f(x) - f(y) < ^2aii(x)l[xi ^ yi\. 



Condition 2. There are functions ctj : A — > R + , (3i : A — > R + , i < N , such that for any x,y E A, 

N 



f(x) - f(y) < Mx) + 0i(y)) l[ Xi ± y 'i 



i=l 



Let us denote 



F{\) := ^ e Kf{x)-mx)) ^ 
G{\) := logF(A), 



N 



V a := E^Ta 2 (X), 



i=l 
N 



Vp := E^/3 2 (X), 



g a {r) := logEe r ^=i a ? (x) , 
gs(r) := logEe^^W. 



If f satisfies Condition 1, then for A > 0, 



2A 2 || 7 1 12 
r>2A2j|7|| 2 r - 2A 2 ||7| 



G(A)< min ^ - " ' - g a (s(X)-r), (2.42) 



and 

G(-\) < 2s(A > )A 2 || 7 || V a . (2.43) 
// / satisfies Condition 2, then for A > 0, 

G < A > s r> ™, T -?xZr ( gM]l)T) + s(t)TV ') • (2M) 



and 



r>4A2||7|| 2 r - 4A 2 ||7| 

Remark 2.6. This result is quite powerful, since all of the previous inequalities follow from 
it (with slightly worse constants). 
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2.2. Results by spectral methods 

In this section, we give some concentration inequalities for empirical averages, using spectral 
methods. For finite (or countable) state reversible chains, the sharp version of Hoeffding's 
inequality has the following form (here we have adapted it to work for non- stationary initial 
distribution too): 

Theorem 2.7 (Theorem 1 of Leon and Perron (2004)). Let X = {X\, . . . ,X^) be a time 
homogeneous, reversible, irreducible, aperiodic Markov chain taking values in some finite 
state space Q, with stationary distribution ir. Let X be the second largest eigenvalue of P 
(X = 1 — j), and f : Q — > [a, b] . Denote S := J2iLi f{Xi), and let Ao = max(0, A). Suppose 
that Xi ~ ir, then 



P 



— > E w f + t 



— < E n f — t 
N 



< 



exp 



1 - A 



°Nt 2 /(b- a y 



1 + A 



(2.46) 



For arbitrary initial distribution, denote Z := N ]_ t Y2i=t +i f(-^-i) > then we have 
P [Z > E w f + t] , P [Z < E w f -t}< exp ( -2^^(N - t )t 2 /(b - a) 2 ) + inf e^w^J . 



1 + A 



0<e<l 



(2.47) 

Remark 2.7. The proof of (2.47) follows by the same argument that we use in the proof of 
Corollary 2.10. 

Now we present a Bernstein-type result for finite state reversible chains, which is based 
on the proof of Theorem 1.1. of Lezaud (1998a): 

Corollary 2.10 (Bernstein inequality for reversible Markov chains). Let X = (Xi, . . . , X^) 
be a time homogeneous, reversible, irreducible, aperiodic Markov chain taking values in some 
finite state space Q, with stationary distribution it, spectral gap 7, and mixing time t m i X [e) 
for some < e < 1. Suppose that f : — > [— C, C] with E n f = 0, and denote Vf := Var 1T (f). 



Let to > ( u burn-in time"), define Z 



12i = t + l f( X i) 

N-t 



, and let 



h(x) :=i(7rT^-(l-x/2)), 



(2.48) 



then for t > 0, 



F[Z - EJ >t] ,¥[Z - E n f < -t] 



< e 7//5 exp 

< e 7//5 exp 



(N - t )t 2 7 



AV f + 4h(5Ct/V f ) 
(iV-t )t 2 7 



inf 

0<e<l 



I '0 



AV f + 10C • t 



inf e 

0<e<l 



I — 



(2.49) 
(2.50) 



Define the asymptotic variance, a 2 , as 



a 2 := lim -Var«{f{X x ) 



f(X N )) , 



(2.51) 
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then the following bounds hold: 



F [Z - E n f > t] , P [Z - E n f < -t] < Jn£ e 



+e 7/5 exp 



/ 



■(N-to) 



er- 



ne 

7 



Aa 2 K'tC - ( a 2 



HC 

7 



2a 2 K' 



t 

C 



,(2.52) 



with K' :-- 



lOVf 



Remark 2.8. We got rid of N q in Theorem 1.1, and thus this form of the bound is more 
useful for practical applications. 

Theorem 3.3. of Lezaud (1998a) (see also Theorem 2.1 in Lezaud (1998b)) generalizes this 
bound to non-reversible chains, with constants in the exponent depending on the spectral 
gap of the multiplicative symmetrization K := P*P, where P* is the adjoint of P in £ 2 {ti). 
The weakness of this approach is that the spectral gap of K can be very small, or even zero, 
and it is not necessarily related to the mixing time of the chain. We propose the following 
improved version, which settles this difficulty: 

Theorem 2.8 (Bernstein inequality for non-reversible Markov chains). Define the pseudo 
spectral gap for an irreductible, aperiodic P with stationary distribution n as 



7 P s 



sup 

k>l 



k 



(2.53) 



With the notations of Corollary 2.10, we have, for t > 0, 



F[Z -EJ >t] ,¥[Z -EJ < -t] 
(N - t )t 2 lps 



< exp 

< exp 



8V f + 8h(5Ct/V f ) 
(N - t )t 2 lps 



+ inf el 

0<e<l 



8V f + 20C ■ t 



+ inf eL 

0<e<l 



■ L "mw 



(2.54) 
(2.55) 



Remark 2.9. Fork » t mix , P k w lim^^P*, and 7((lim t _ >00 P 1 )* lim^oo P*) = I, so 7 ps 
can not be much smaller than l/t mix . 



2.3. Results for continuous time chains 



Our next two results are based on Corollaries 2.4 and 2.9, and show concentration inequalities 
for (not necessarily reversible) continuous time Markov chains (with countable state space). 
The proof of these are left to the reader as exercise (they can be done using the same 
technique as in the proof of Theorem 3.4. on page 858 of Lezaud (1998a)). 

Corollary 2.11 (Hoeffding inequality for continuous time Markov chains). Let (X s ) s > be 
a time homogeneous, continuous time Markov chain taking values in some countable space 
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Q, stationary distribution ir, and mixing time t^J. Let f : — > [a, b]. Let t > ("burn-in 
time"), denote 

Z := f f(X s )ds. (2.56) 

J — to j s=to 

Then for every t > 0, 

P ( Z > EM) + + *V P f Z < E,(/) - - ^(2.57) 

< 



cxp 



T-to 
-2{T-t )r 
(b - a)\ 



cont 
min 



Corollary 2.12 (Bernstein inequality for continuous time Markov chains). Let (X s ) s > be 
as in Corollary 2.11. Let f : O — > [—C,C], and Z as in (2.56). 
Denote 



V:=E^ f(X s ) 2 ds^j . (2.58) 



Then for every t > 0, 



< ex P ( zfSjz^im. \ 

~ \8V + AV2(T -t )CtJ 

As a comparison, the following theorem is the main result of Lezaud (2001): 

Theorem 2.9 (Theorem 1.1. of Lezaud (2001)). Let P t be an ergodic Markov semigroup 
with invariant probability measure n. Assume that its infinitesimal generator L has as simple 
isolated eigenvalue A = and that the initial distribution q has a L 2 (tt) density relatively to 
the measure n. Then, for all f G D 2 (L) such that ir(f) = 0, ||/||oo — a , for all t > and 
T > 0, 

2Tt 2 

^(T^St >t)<N q exp { ^ } , (2.60) 



a 2 (l + y/l + 4at/(7(T 2 ) 



with St '■— J Q T f(X s )ds, a 2 : = lim^oo T 1 Var 7r (Sx), 7 is the spectral gap of (L + L*)/2, and 
N q is the L 2 (tt) norm of the density of q related to the stationary distribution ir. 



3. Applications 
3.1. Coin tossing 

The reader might think that independent Bernoulli trials is a good model for coin tossing. 

We disagree. In a famous paper, Diaconis, Holmes and Montgomery (2007), it was shown 
that it is slightly more likely for the coin to come up on the same side as it was at the 
beginning. 
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The claimed 2% difference would be hard to notice in practice, unless one makes very long 
trials. However, it becomes much more evident if we use a relatively large sized coin, and 
throw in not too vigorously: instead of flipping it with the thumb, just throw it upwards 
using the palm. 

We have made our experiments with a Singapore 50 cent coin, and tossed it up 40-50 cm 
high. Our results for 1200 coin tosses (1 corresponds to heads, and to tails): 

1001001011110011111100111111111000001101000011111111101011111000000101000001 
1001100000010111100001100010000001111110000111111111001111000000010011100000 
0000111000100001111100010101010000110011011001000111110000000011111011000000 
0110100000010101000110111111001000000011010001001111101010111101101111001111 
0001100000000000010011000001111101000000110011011111011000000011111110011010 
1101101000010000010111111111110100111111100011000000011111000011110100100010 
1100110101100111001011011100111000001111110010011100111011111110011101100000 
1111100101000110010111111101110010011111111110000000000001000000001011010011 
0000010111110000011000001101110010111111100101111100100111111011011100011111 
1100101110000011111101111100011111000001011001111110001111001111111110000000 
0001111000000110111111111011110000111001111010001111110011000111111110000001 
0010100010010011110000000100110111011100000010001111111000000110001100011110 
0111100111011111110010101000011011011001111111100000011111111101000000001110 
1100011101110101111100010011011111111111111000111000011110110001111000111000 
1111011111110001001001101011000001001000010101001011100000111111000110110101 
011101101011010100100101101111111011011010011101111111111111 

We have used this sequence to estimate the variance of 

S n '■— X\ + . . . + X n . 

For Xi, . . . , X n i.i.d. Bernoulli variables, VarS n = n/4, so fixing n = 40, the variance should 
be 10. Breaking our experimental result into 30 slices of length 40, and denoting the sums of 
each slice by S^, . . . , S^ 30 \ computing the sample mean m s := (^j <30 S^)/30 and variance 
V, := (E^=i( 5(1) - m s ) 2 )/29 of these slices, we get 

m s = 0.5333 and V s = 16.5747. (3.1) 

This is very different from what we expect from i.i.d. Bernoulli trials. Physically, it is clear 
what went wrong: the height is too low, so the coin only turns a few times at each toss, and 
it is more likely to end up on the same side where it started. We can model this with a 2 
state Markov chain with probability transition matrix 

rp _ ( Poo Poi \ 
V P io Pn J ' 

We estimate these probabilities by the counts of 00, 01, 10, 11, which we denote by #00, 
ect. Thus we get the estimate 
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#00 #01 



#oo+#oi #oo + #oi \ „ ( 0.6071 0.3929 
#lfifel J ~ V 0-3443 0.6557 

Let Xi, . . . , X1200 be a Markov chain with this transition matrix, started from 1. Since we 
do not have a closed formula for V s for this model, we just ran a long computer simulation 
(100000 runs, 1200 steps in each), which gave us that the expected value of this variance 
is approximately 16.8, which is very close to what we have observed. This means that the 
Markov chain model describes the real situation better than i.i.d. Bernoulli trials. 

For such a Markov chain, Corollary 2.4, or Theorem 2.8 can be applied to bound the 
deviation probabilities of S n from its expected value. 

The reader should not think of this as an isolated example, Markov chain models have been 
successfully applied to many real life situations. For an example about basketball gambling 
which outperformed the models of the bookmakers, see Kvam and Sokol (2006). 



3.2. Error analysis for MCMC 

MCMC methods have a huge literature. They are used, amongst other things, for simu- 
lating distributions arising from statistical physics, approximate counting in combinatorial 
structures, approximate integration, stochastic optimization (simulated annealing), ect. Our 
favourite review papers are Diaconis (2009) and Jerrum and Sinclair (1996). 

Let Xi, . . . be a time homogeneous Markov chain, taking values in A = Q N , with 
stationary distribution it. We may be interested in computing the expectation K w f for some 
function / : Q — y R. This can be approximated by the average 

/(*,) + ... + /(*,) 

*j N 

A natural question to ask is how large N should be so that this approximation is good, i.e. 
how long should we run our simulation. 

This problem have been extensively studied in Lezaud (1998b) for reversible Markov chains 
with finite/general state space, and reversible Markov processes with finite/general state 
space, with Bernstein-type results proven in all situations, that are roughly I/7 times weaker 
than in the independent case. However, these results involve a constant N q that depends on 
the initial state of the chain, which may be difficult to compute in practice, and the results 
for non-reversible chains are not satisfactory. 

Given the mixing time i m i x (e), and the spectral gap 7 of the chain, Corollary 2.4, Corollary 
2.9, and Corollary 2.10 gives bounds on the deviation _ ^ w f. 

Finding bounds on the spectral gap and the mixing time has a large literature, we refer 
the reader to Levin, Peres and Wilmer (2009), and Lovasz and Winkler (1998). 

One example of such a bound is Theorem 3 in Bubley et al. (1997): under the Dobrushin 
uniqueness condition, i.e. if the maximum column sum of the Dobrushin matrix is a < 1, 
the Gibbs sampler Markov chain of a statistical physical model has mixing time 



tmix(e) < [nlogfne" 1 ) /(!-«)] 



(3.2) 
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This bound works for many statistical physical models (Curie-Weiss, Ising, Potts, ect.) at 
sufficiently high temperature. 

For a bound on the coefficients of the Dobrushin matrix, see Chatterjee (2005), page 79, 
Lemma 4.4. 

For more examples and simulation results, we refer the reader to Gyori and Paulin (2012). 



3.3. m- dependence 



We say that X±, . . . ,X]^ are m-dependent random variables if for each 1 < % < N — m, 
(Xx, . . . , Xi) and (X i+m , . . . , X N ) are independent. 

For this dependence structure, we can define n := \— ], 



X\ :— (Xi, . . . , X m ), . . . , Xn : — (X(„_i) m+ i, . . . , Xn). 
For X, we construct a Marton coupling Ai: 



>i+l ~ 



P 



\x<i),X> 



>i+l 



pn 



>i+l\'\ X <i-li X i) 



is constructed by first defining X™ i+2 = X' >i+2 , with distribution P> i+2 (here we use the 
m-dependence condition), and then defining Xi + ± and X' i+1 conditionally on these two. There- 
fore, it is clear that the mixing matrix for M. satisfies 



/ 1 1 
110 











\ 



/ 



(3.3) 



so we can see that ||r|| < 2 and 7 < 2, so our theorems hold under this condition, with 
s(X) = m. Thus the constants in the exponents are about 4m times worse than in the 
independent case. 

We finish this section with the following "metatheorem" : 

Metatheorem 3.1. Suppose that X±, . . . ,Xn are dependent random variables that can be 
put in a sequence with a typical range of dependence m. Then the concentration inequalities 
hold with constants cm times weaker than in the independent case, for some constant c 
(independent of N,m). 

Proof. Define X as in the m - dependent case, and then construct the Marton coupling Ai 
for X. □ 



3-4- Independent permutations and Bernoulli variables with fixed sum 

Let X±, . . . X n be 

1. Uniformly chosen random permutations of 1, . . . , n, or 

2. 0, 1 valued random variables with J^™ =1 Xi = k for some < k < n, and uniformly 
distributed among the possibilities. 
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Then 



>i+l ~ 



pn , 



•|£<i),*' 



>i+l 



^>i+l(*l^<i-l) ^i; 



is constructed by letting I be distributed uniformly on . . . , n, choosing Xj = x' { , X\ 
and then setting the rest of X> i+1 and X> i+1 the same. 
For such a coupling, we have the coupling matrix 



.} j 



r = (r 



i,j )i,j<n 



< 



( 1 





1 1 1 1 1 

n— 1 n— 1 n— 1 n— 1 n— 1 

1 1 1 1 1 

n—2 n—2 n—2 n—2 



... \ 



(3.4) 



\ ... 1 / 

This means that for 1-Hamming Lipschitz functions, i.e. c weighted Hamming Lipschitz 
functions with c = (1, . . . , 1), we have ||T • c|| 2 = 4, so by Theorem 2.1, the Mcdiarmid and 
Hoeffding inequalities hold with constant 4 times worse than in the independent case. 

For permuations, a much stronger result, Talagrand's convex distance inequality, with 
constant 4 times weaker than in the indepedent case, was proven in Section 5 of Talagrand 
(1995). This was further developed in McDiarmid (2002). For an overview, see Section 8.2 
of Ledoux (2001). See also Chatterjee (2007) for a concentration inequality in the setting of 
the combinatorial central limit theorem. 

Unfortunately, ||r|| ~ ^\og(n) and ||7|| ~ y/n, so we can not recover these results. 



3.5. Hidden Markov chains 

Concentration inequalities for Hidden Markov chains have been investigated in Kontorovich 
(2006), see also Kontorovich (2007), Section 4.1.4. 

Let Xi, . . . ,Xn be a Markov chain (not necessarily homogeneous) taking values in a 
discrete set A = Ai x . . . x An, with distribution P. 

Let Xi, . . . , Xn be random variables taking values in the discrete space A = Ai x . . . x An 
such that the joint distribution of (X, X) is given by 



H(x, x) 



P(x) ■ 



i=l 



i.e. Xi are conditionally independent given X. Then we call X\, . . . ,X^ a hidden Markov 
chain. 

The following result (an extension of Theorem 4.1.4 of Kontorovich (2007) to our setting) 
shows that the concentration properties of a hidden Markov chain are completely determined 
by the concentration properties of the underlying chain. 

Proposition 3.1. Let 



X : — ( X i ..... X r 



) ) , I X il + i, 



X := [X, 

((Xi, . . . , X ix ) , (X il+1 , . . . , X, 



Y 



X 



in- 1+1 ! 



X 



N 



+1) 



x N )) 
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be partitions of X and X . Suppose that M. is a Marton coupling for X , with mixing matrix 
V , then there is a Marton coupling Ai for X with mixing matrix T < T (in each element). 

Proof. Suppose first that X = X, then n = N and s(X) = 1 (the general case is similar). 
We are given 



m [x% i+1 ~ p^ +1 (-fe),!'> i+1 ~ 4" l+ i(-i^-i^r 

and need to construct 

M l [X£ i+1 ~ P^ +1 (-\x< t ),X'l l+1 ~ P^i-lx^xZ 
This can be done by first defining a coupling 

tt X'>, +1 ) ~ x'J, ~ H(.\x<i, XZ i+1 ),X% i+1 



~.n 



satisfying that given -X> i+1 and X' >i+1 , (X i+1 , X' i+1 ) . . . , (X n , X' n ) are independent, with 
(Xj,Xj) distributed as the maximal coupling of the distributions Pj(-\Xj) and Pj(-\Xj). 

One can see that, by the Markov property, marginal distribution of X™ i+1 and X'> i+1 only 
depends on x» and x\ and does not depends on x<i-\. 

Therefore, we can construct M 1 by first defining (JQ, X!) as the maximal coupling of 
Pi{'\ x <i) an d Pi{'\ x <i-ii x 'di an d then defining the rest of it as in tt, given Xi,X[. Finally, it 
is easy to check that r < T. 

Note that the Markov property is necessary for this proof to work, see Kontorovich (2007), 
page 36 for a counterexample when X is not Markov. 

□ 



3.6. Random walks on weighted graphs 

We adapt the notations of Gillman (1998). Let G = (V, E) be a connected undirected graph, 
with each edge {x, y} in E having weight w xy . Let w x := ^2 y .r x y y &E w xy be the weight of x. 

Then a random walk on G is equivalent to a time reversible Markov chain with transition 
matrix T := {p xy ) xyeE , with 



Pxy 



Wx 

if not. 



We denote the eigenvalues of T by 1 = Ai > A2 > . . . > A|y|. We denote by 7 := 1 — A2 
the eigenvalue gap, this is strictly positive for connected graphs. The stationary distrubtion 
of the walk is denoted by 

/ \ w x 

n x ) = ^ • 

l^yev w v 

The main result of Gillman (1998) is the following theorem (see also Kahale (1997) for a 
sharper version): 
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Theorem 3.1 (Theorem 2.1 of Gillman (1998)). Consider the random walk on a weighted 
graph G = (V, E) with initial distribution q. Let A C V . Let A n be the number of visits to A 

in n steps. Let A= denote the vector A={x) = -4^=, and N a = 



For any t > 0, 



P (A n - nir{A) >t),F{A n - mr{A) < -t) < (1 + te/(10n)) N q e- t2 ~< l{Wn \ (3.5) 
and the same bound holds for the lower tail. 

Remark 3.1. A similar bound can be deduced from Theorem 1.1. of Lezaud (1998a), and 
thus by Corollary 2.10. See also Theorem 2.1. 

This theorem is very useful for probability amplification, here we briefly review Dubhashi and Panconesi 
(2009), Section 3.5.3. 

Suppose that one has a random algorithm, computing a function / : {0, l} n — > {0, 1}, 
which takes in x G {0, 1}™ and an n long sequence of random bits r, and gives the result 
A(x,r). Suppose that the probability of correct evaluation is bounded away from 0, lets say 

F{A(x,r) = f{x)) > 3/4. 

Then the goal of probability amplification is to increase this success probability. This is can 
be done by running the algorithm k times for k independent r, and then the chance of having 
at least half of the results the same is, by Hoeffding inequality for independent variables, 
smaller than e~ k ^ 8 . 

Such a direct method would use nk random bits. 

We can get this results using fewer bits, the following way. First, let G = ({0, l} n , E) 
be connected (i-regular undirected graph, more precisely, an expander graph ( the reader 
can consult Hoory, Linial and Wigderson (2006), Tao (2010), and Kowalski (2011) for a 
review). These graphs were popularized by the papers Ajtai, Komlos and Szemeredi (1983) 
and Ajtai, Komlos and Szemeredi (1987). 

For our purposes, it suffices that there exists graphs that the eigenvalue gap of the asso- 
ciated "uniformly" weighted random walk is roughly Let G be such a graph. 

We take r\ to be uniformly distributed in {0, 1}" (which is the stationary distribution 
7r), and r 2 , . . . ,ri be a random walk on G. Then using Theorem 3.1, we can prove that the 
chance of having less than half of A(x,ri), . . . ,A(x, r{) be correctly evaluated, is less than 
e -cie f or some universal constant c. Since generating r 1; . . . , r; only takes n + / log 2 d bits, 
and e > 4^, we can see that choosing / = \/dk/c gives the same precision as k independent 

n long sequences, and takes considerably less, n + k\og 2 dy/d/c random bits. 
A natural question: how is this setting related to our theorems? 

Let us denote the mixing time of the random walk on G by t m ; x , then we can write 
A n = YH=\ e A\i an d use Corollary 2.4, to get that the concentration of A n around its 
mean is about t m[x times worse than in the independent case. 

It is easy to see that in k steps, we can visit less than d k vertices, and will be far away from 
the uniform stationary distribution in total variation distance unless k > log d \ V\ = n\og d 2. 
Therefore t m i x > n\og d 2. This means that Corollary 2.4, in this case, is much weaker than 
Theorem 3.1 and Corollary 2.10. 
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However, we seriously doubt that such an inequality could hold for more general functions, 
for example sums of the form YliLi /iPQ)> or Hamming-Lipschitz functions. 

Therefore we consider this as an example of "superconcentration" : for empirical sums of 
the form Yli=i /G^»)> m uch stronger concentration occurs than for arbitrary functions. 

Beyond probability amplification, this technique can be also used to evaluate the expec- 
tation of any / : {0, l} n — > [a, b], real valued function by approximating it by 

Ef(X) « (3.6) 

X and Xi being uniformly distributed in {0, l} n , and {Xi\2<i<N being the random walk on 
the expander G. This approximation uses only n + N log 2 d random bits (instead of Nn bits 
by the independent sampling), and its precision can be estimated by Theorems 2.7, and 2.10 
(the constants are roug hly Vd times worse than in the independent case). 



4. Open problems 

Since most of our concentration results are roughly t m ; x times weaker than in the independent 
case, the following questions naturally arise: 

1. (Talagrand's suprema of empirical processes) It would be interesting to prove the com- 
plete version of Talagrand's suprema for empirical processes inequality, i.e. improve 
Theorem 2.5 by replacing W with 




One could try to further bound V with max,,< A /E (J2i<N Aj(^) 2 ) an( ^ 

E (maxj<A/ |Si<zvAi(-^») )> as ^ * s done in the independent case (the proof for this 

result on page 141 of Ledoux (2001), and page 112 of Ledoux and Talagrand (1991), 

breaks down in the case of dependence). For an elegant proof of these results in the 

independent case using the entropy method, see 169-170 of Massart (2007). 

The reader could approach this problem by further developing Theorem 2.6, or by 

adapting Talagrand's q point method to the dependent case, see Talagrand (1996), 

Dembo (1997). 

2. (Unbounded random variables) Lemma 5.5. of Vershynin (2010) shows that three nat- 
ural definitions of subgaussian random variables (tail bound, moment bound, subex- 
ponential moment) are in fact equivalent. 

Definition 5.7. of Vershynin (2010) defines the ip2 norm of a real valued random variable 
X as 

\\X\\to = s* Pp - 1 ' 2 (E\X\>) 1/p . (4.2) 
p>i 

For bounded variables, we have ||X||^, a < ||X| loo- 
Proposition 5.10 of Vershynin (2010) gives a Chernoff-Hoeffding type inequality: 
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Proposition 4.1. Let Xi, . . . ,X^ be independent, centered, subgaussian random vari- 
ables, and let K := maxj ||Xj||^ 2 . Then for every a = (a 1; . . . , ajv) £ 1& N and every 
t > 0, we have 

N 



P 



y^^iXj 



i=l 

where c > is an absolute constant. 



K 2 ■ Ei<N a l 



>t) <e-W\-T^ 3 h (4-3) 



Theorem A. 7.1 of Talagrand (2011) is the following unbounded version of Bernstein's 
inequality: 

Theorem 4.1. Let Xi, . . . ,X]^ be iid centered random variables, distributed like X 
with EX = 0. Assume that 

\X\ 

Eexp^ < 2. 

Then for allt > we have 

— ) ) , and (4.4) 



P [ YXi > t 1 < exp ( ^ ( 1 - 4A \ ) ) . (4.5) 

I 1 - I - v\ 4NEX 2 V iV(EX 2 ) 2 I I 



It would be interesting to adapt these results to Markov chains, with constants I/7 
times weaker for empirical averages of reversible chains, and t m ; x times weaker in gen- 
eral. See Adamczak (2008) for a similar result. 

3. (Moment inequalities) Boucheron et al. (2005) proves various moment inequalities for 
functions of independent random variables using the "entropy method". It could be 
interesting to generalize some of these to functions of Markov chains, with constants t mix 
(or I/7) times weaker than in the independent case. We note that Chazottes and Redig 
(2009) proves moment inequalities for some Markov processes. 

4. (Berry-Esseen) Berry-Esseen theorems for empirical averages of discrete Markov chains 
were proven in Bolthausen (1980), Mann (1996) (for reversible chains, of order 

and Lezaud (1998b) (for discrete time and continuous time reversible Markov chains, 
with improved constants compared to Mann (1996), see also Lezaud (2001)). 



It would be interesting to get explicit Kolmogorov bounds of order J -ff- for empirical 
averages of reversible chains, and \J^ff- for more general cases. 

To convince the reader that this is indeed the correct order, we have the following 
proposition: 

Proposition 4.2. Let X\, . . . ,Xn be m-dependent, real valued, zero mean random 
variables, with finite third moments, and let 

N 



W = J2 X i^ 2 = E ( W<2 



i=l 
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Let n = \— 1 , and 

(Yi, . . . , Y n _i, Y n ) := (Xi + . . . + X m , X m+1 + . . . + X 2m , . . . , X( 

n— l)m+l ~r ••■ 

+ x N ) 

Then we have 



sup 



( — < z 



^ 9075 ^ E| ^ |3 (4J) 



i=l 



Remark 4.1. Form - dependent sequences, we typically have a ~ y/rnN, Y17=i ^l^i| 3 ~ 
m 2 N , thus the bound is of order \7^- As far as we know, this is the first bound of this 
order for m-dependent sequences. 

Proof. This is a simple application of Theorem 7.4. of Barbour and Chen (2005) to 

Y]_, . . . , Y n . We can also use the same method to show bounds of order 11 M ■ \J~^- for 
d dimensional m-dependent random fields. □ 

5. (Moderate deviations) Moderate deviations can be seen as an extension of Berry-Esseen 
to a larger range. Following the notations of Section 11 of Chen, Goldstein and Shao 
(2011): 

Let Xx, . . . , X n be i.i.d. centered random variables with variance 1, and W := Xl+ '^ Xn ■ 
Let <&(z) denote the standard normal CDF, then 

1 - $(z) x /n 



for < z < ^^/(ElXxl 3 ) 173 . 

This result is generalized to several dependence structures using Stein's method (see 
also Chen, Fang and Shao (2009)). 

It would be interesting to prove such a result for functions of Markov chains with 
explicit constants, since in the range z < n 1 / 6 / (ElX^ 3 ) 1 ^ 3 , it is stronger than the 
concentration inequalities we got. 
6. (DKW) The Dvoretzky-Kiefer-Wolfowitz inequality states the following (this version 
was proven in Massart (1990)): 

Theorem 4.2. Let denote the empirical distribution function for a sample of N 
i.i.d. random variables with distribution function F . Then for every A > 0, 




■ F(x)\ > A J < 2exp (-2A 2 ) . (4.7) 



This result means that the Kolmogorov distance of F^ and F is typically O > 

and allows the construction of confidence region for F . It would be interesting to 
show this result for Markov chains, with constants roughly i mix times weaker than in 
the independent case (I/7 times weaker for reversible chains). A similar result, for 
geometrically ergodic Markov chains, was proven in Kontorovich and Weiss (2012). 
For a simpler exposition of Massart's proof, see Dudley (2011) and Chapter 1 of Dudley 
(2012). 
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7. (Bernstein- type DKW) It is clear that for any x sufficiently large, Fn(x) has a small 
variance, and thus by Bernstein's inequality, is closely concentrated around its expec- 
tation (in a range much smaller than ^=)- 

Let f(X) = jr^2 i<N fi(Xi) with \fi(Xi)\ < C, then Bernstein's inequality is of the 
form (d, C2 > 0, V = Y.<<n (/iW - V.W)') 

n\nx)-mx)\>t)<^( CiV -f* NCt ), («) 

which is equivalent to 

P I l/W - E/(X)| > ^^^f^'g ) < (4.9) 



Let (Xj)j<Ar be the random sample, with empirical distribution 

i<N 

Then it is natural to conjecture that for some ci, C2 > 0, 

c 2 CV + J(c 2 Ct) 2 + 4 Ci V(x)t/N r 
P ( |Fjv(x) - F(x)\ > 2N for any x G E < e~ T 

holds, with V(x) = NF(x)(l—F(x)) (which could be compared with V(x) = NF(x)(l — 
F(x)) using the original DKW inequality). This would allow sharper confidence regions, 
especially at the tails. 

We think that coupling arguments and breaking the chain into blocks of length t m j x (or I/7) 
could be useful. Most of these questions could be directly applied to the analysis of MCMC 
simulations. We plan to keep up-to-date information about progress on these problems on 
our webpage. 

5. Proofs 

The maximal coupling of two random variables is the coupling that achieves the total vari- 
ational distance of their distribution (see Lindvall (1992), and Samson (2000) page 437): 

Definition 8. Let P and Q be two measures defined on a common countable state space fl. 
We define the maximal coupling of P and Q, denoted by fi m 'ax' as 

p,q, v ir , ■ m( s n( „ , (P(x) - Q(x)) + (Q(y) - P(y)) + 

Hmdx{x,y) = if = v\ • mm {P{x),Q{y)) + — — — (5.1) 

Remark 5.1. The coupling easily generalizes to the non-discrete case. One can see that 

d TV (p,Q) = j2(P(x)-Q(x)) + = J2(Q(y)- p (y))+ ( 5 - 2 ) 

= 1 - ^min(P(x),Q(x)). 
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Also note that 

(P(x) - Q(x)) + (Q(y) - P(y)) + = (P(x) - Q(x)) + (Q(y) - P{y))+l[x ? y]. (5.3) 
The following lemma will be useful in the proofs: 

Lemma 5.1. Let X, X, Ai and T be as in Theorem 2.1. Let Y e A, Y ~ Q, let Y be a 

partiton ofY, similarly to X (i.e. T(Yi) = I{Xi) for i < n), and Y ~ Q. Then 

D(P\\Q) = D(P\\Q). (5.4) 

Let C(c) be as in (2.5), then 



d c (Q 1 P)<d c(c) (Q,P). 

Finally, in the case of d<i distance, 

d 2 (Q,P) < y[s(k)-d2{Q,P). 
Proof. (5.4) follows from P(x) = P(x) and Q(x) = Q(x). 



(5.5) 
(5.6) 



E 



7r(X~P,Y~Q) 



i=i 



E 



7r(X~P,y~Q) 



J^f max l[X^Y 3 ]).a(c) 



,i=i 

N 



The infimum of the left hand side is greater or equal to the infimum of the right hand side, 
so (5.5) follows. 

For the d 2 distance, for a coupling n(X ~ P, Y ~ Q), denote the "corresponding coupling" 
of X, Y by tt (^X ~ P, Y ~ Q j (i.e. ir(x, y) = fr(x, y)), and denote the coupling of these 4 
variables by II. 



1/2 



,y&A i=1 

n 2 



1/2 



1/2 



> 



> 



EE E 

yak 4=1 jex,(x) 



s(X) 



Yi = iji 



■Q(y) 

1/2 
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so taking infimum, (5.6) follows. □ 

The following coupling construction is a generalization of the construction that has ap- 
peared in Samson (2000), Proof of Theorem 1. 

Before we start, let us review a simple fact about conditional independence: for 3 discrete 
random variables X, Y, Z, we say that X is conditionally independent of Y given Z if 

P(X = x,Y = y,Z = z) (5.7) 
= P(X = x\Z = z)P(Y = y\Z = z)¥(Z = z) 

F(X = x, Z = z)¥(Y = y,Z = z) 
~ F(Z = z) ' 

Construction 5.1. Given two measures P and Q defined on state space A, we are going 
define a coupling n p,< ^ of random variables Y, JW, . . . , taking values in Ax Ax A> 2 x 
...xAf„ with X« ~ p andY ~ Q. 

Step 1.1. Let (Xf^Vi) ~ yUmax? ^- e - maximal coupling of Pi and Q\. 
Step 1.2. Given {X^\Yx), we define 



(I?, • • - ,X$\X?\Yi) ~ PU-\X?) and {xf\ . . . ~ P^Y,) 



as 



(xi^...,X%\x£\...,xV\Y 1 ,X^)~ M \.\Y 1 ,xV). 

In the following, we do similar steps iteratively. Assume that we have already defined 
X^\ . . . , and Y<j_i for some 1 < i < N, and that this satisfies 

(xWiy^o-PgM^-i). 

Then we do the following steps: 

Step i.l. In this step we add 
We want 

[Xf , Y t \Y^ x ) ~ f J l ^ i - l),Qimi - l) (5.8) 

and we want Yi to be independent o/XW,...,X' ,_1 ',X"L given Y<i_ u Xf\ From 
(5.7) one can see that both of these are satisfied by the definition 

n p >« (y< h (5.9) 

.Step z.S. iVc>u> we introduce X^' l+l ^ (we skip this step for i = N). We want 

Y#,X®), (5.10) 



x^\x^ x 



Y <u xf ] ) ~Af 



and we want X^ l+1 ' to be independent of X^\ . . . ,X^ 1 ^ given (X^\ Y<i). Both con- 
ditions are achieved by 

IT^ (y^,x^,...,x^) 

n p <? (v< t ,x0), . . • M* (s^VS?^ y< u xf) 



pN ( „n(i) 
>i+l I X >«+1 



y<i-i,a?i 
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Iterating these steps up to i = n completes the definition ofU p,< ^. 

We continue with the well know tensorization property of the relative entropy: 

Lemma 5.2 (Lemma 1 of Samson (2000)). Let P and Q two probability measures defined 
on A, E x :=D(Qi||Pi), 



E t := D {Qi(-\y<i-i)\\Pi{-\y<i-i)) ■ Q<i-i(y<i- 

2/<i-l6A<i-l 



for 2 < i < N, then 



D(Q\\P) = J2 E - 



(5.11) 



(5.12) 



i<N 



Finally, we will need a property of the d 2 distance in 1 dimensions: 

Lemma 5.3 (Lemma 2 of Samson (2000)). Let Q be a discrete state space, and R,W two 
measures on Q. Define 

9 -i 1/2 

Rix) 1 



d v {R\W) :-- 



.xefi 



1 - 



W(x) 



then 

and thus 
and 



W(x) 

d v {R\Wf + d v {W\Rf < 2D(R\\W) 
d v (R\W) 2 < (2D(R\ \W)) l/ \ 



(5.13) 
(5.14) 



d v {W\R) 2 < {2D{R\\W)) 1/2 . 

Now we are ready to start our proofs: 

Proof of Theorem 2.1. First, let us suppose that X = X and thus s(X) = 1 and N = n. Let 
n := n p - Q , (Y,XW, . . ~ n (see Construction 5.1), and let X := Then 



d c (Q, P) < E n p,< 



Y^CillX^Yt] 



i<n 



First let us deal with the i — 1 term. From step 1.1 in the definition of n p,( ^, and Pinsker's 
inequality, 



E n p,g [CilfXi ^ Fx]] < CxdrviP^Q,) < G x \\-E x . 



For i = 2, we can write 



1[X 2 ^ Y 2 \ < 1 X 2 ± X, 



-(2) 



Similarly as before, 



E u [l[X 2 ^Y 2 ]\Y 1 )=d TV (P 2 (-\Y 1 ) 1 Q 2 {.\Y 1 )) 
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so by concavity of the square root, 



E n (l[X 2 ^Y 2 ]) < \j-E 2 . 



By step 1.2, we can see that (l (2) ,I>? *i,-Xi) ~ M 1 {-\Y 1 ,Xi), thus 



E 



n p ><* 



X 2 ? X® 



Y 1 ,X 1 



E^i [1[X 2 ^ X' 2 }\ Y 1 ,X 1 ] < T 1>2 1[X 1 ^ Yi[ 



From our result for i — 1, we get 



E n p,« 



1 



X 2 ^X, 



(2) 



Similarly, for i = k, we can write 

E n 
E n 



(2) 
A- 



<ri,2M/fi. 



k I k 



Xjp + Y k 



^ Y k 



J 2 Ek 



1 



U) i yO'+l) 



< Er 



E 



1 [** ^ X' k ] 



< r„ fe E n 



1 



Yj ^ Xf < T m l -E 3 for 1 < j < k < n. 



By summing up in i, we get 



d c (Q,P) < E n 



i<n 



30 



^ E^E r ^v^ = E c ^\ E i ^ w - c\\^\d{q\\p). 

i<n j<i i,j<n 

The general case, (2.4) follows by Lemma 5.1. □ 

Proof of Corollary 2.1. This follows from Theorem 2.1 by Proposition 6.1. of Ledoux (2001). 
See also Problem 12.3 of Dubhashi and Panconesi (2009). □ 

Second proof of Corollary 2.1. Here, we are going to give a direct proof of this result based 
on the martingale approach of Chazottes et al. (2007) (a similar proof is probably possible 
using the method of Kontorovich (2007)). 

We will write f(X) := f(X), then as a function of X, it is C(c) weighted Hamming 
Lipschitz (for x,y e A, f(x) - f(y) < d C ( c )(x,y)). 

Let us define T\ = o~(X\, . . . , X{) for i < n, and write 



n 

f(X)-Ef(X) = f(X)-Ef(X) = 

i=i 



with 
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Vi(X) := E(/(X)|7i) - E(f(X)\^) 

= /2 ^>i+l(^>t+ll^<*) ' f(X<i, 2>i + i) 

Z >i + 1 

= ^2 P >i+l( Z >i+l\X<i) ■ f{X<i,Z> i+1 ) 

Z >i + 1 

~'5~]Pi{zi\X<i-.i) P> i+1 (z> i+1 |X<i_i, Zi) ■ f(X<i-i, z>j) 
24 z >i+i 



< sup ^ ^i(^>H-il^<i-i.a)/(^<i-i.a^>H-i) 
inf £ % +1 (^ i+1 |^<i-i, 6, 



>i+l 



=:M,(X)-m,(X), 

here M,(X) is the supremum, and mj(X) is the infimum, and we assume that these values 
are taken at a and b, respectively (one can take the limit in the following arguments if they 
do not exist). 

After this point, Chazottes et al. (2007) defines a coupling, 



7T 



(Zg^ ~ P>Vi(-|^-i ; «)^>Si ~ P>i+i(-\X<i-i,b) 



as the maximal coupling between these two distributions. Althought this coupling minimizes 

(2) 1 
i+lJ 



expectation of ^ ^>i+i] ; it is not always the best choice. 



We define 

* ~ PZi+A^i-u^Z^ ~ P>" + i("l^-i^)) (5-15) 

:= (zgft ~ ^(.iVLa),^ ~ P>Vi (-l^-i, &)) • 
From this coupling, one can see that 

M t (y) - m t (Y) = E n (/(*<<_!, Z^p) - /(*<*_!, Z n ^ ] )\X^_^ 



n 
j=i 



X i , . . . , X, 



The following result was proven in Devroye and Lugosi (2001): 
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Lemma 5.4. Suppose J 7 is a sigma-field and Zi, Z 2 , V are random variables such that 

1. Z X <V <Z 2 

2. E(V\F) = 

3. Z\ and Z 2 are J 7 -measurable. 
Then for all A G R, we have 

E(e xv \T)<e x2 ^ z ^/ 8 . (5.16) 

Now using Lemma 5.4 with V = V h Z x = m^X) - E(/(X)| Ti-i), Z 2 = M^X) - 
E(/(X)|J r j_ 1 ), and T = J~i-i, we get that 



Fi-i) < exp 



A 2 



By taking the product of these, we get 



E e A/ W <exp -||r-C( C )|| 



A 2 



(5.17) 

and the tail bound follows by Markov's inequality. □ 
Proof of Corollary 2.2. The main idea: we divide the index set into mixing time sized parts. 



We define the following partition of X: let n 



N 
r(e) 



and 



X :— (Xi, . . . , X n ) 

:= ((Xi, . . . , X T ( e )) , (X T ( e ) + i, . . . ,X 2t ( £ )) , . . . , (X( n _i) T ( e ), . . . , 



Xn)) 



Such a construction has the following important property: X\, . . . ,X n is now a Markov 
chain, with e mixing time f(e) = 2 (the proof of this is left to the reader as an exercise). 
Now we are going define a Marton coupling Ai for X, i.e. for i < n, we need to define 



X> i+l ~ P> i+1 (-\x<i),X'> i+1 ~ P> i+1 (-|x<j_i,f^ 



First step: we define (Xi +2 , X'i +2 ) as the maximal coupling of 

Pi+2( m \x<i), Pi +2 (-\x<i-i, x'j), 

then we have 



M^X^^X'i^x^x'i) <e. 



mUx^x* Xi +2 , X' i+2 



Second step: we define 



as the maximal coupling of P i+ i(-\x<i, X i+2 ), P i+1 (-|x<j_i, x' ir X' i+2 ). Then trivially 

M\X l+1 ^X' l+1 \x< u x[)<\. 
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Third step: let M l [X i+4 ,X' X i+ i, X' i+1 , X i+2 , X' i+ 2j be defined as the maxi- 

mal coupling of 

P i+ 4(-\x<i, X i+ i, X i+2 ), Pi+i(-\x<i-i, x-, X' i+l , X i+2 ). 
By the Markov property, we have 

Pi + ±(-\x<i,Xi + i,X i+2 ) = P i+ i(-\Xi +2 ), and (5.18) 
Pi+i(-\x<i-i : x i: X^ +1 ,X i+2 ) = P i+4 (-|X- +2 ), (5.19) 

therefore it is easy to see that 

M^X+^X'^lx^xD^e 2 . 

Fourth step: we define 

i+3 \%<ii ■F'ii 

Xi+i, X'i + i, X i+2 , X' i+2 , X i+ 4 ^ X' i+ 4 

as the maximal coupling of 

Pi+z{-\x<u X i+ i,X i+2} Xj +4 ), P i+3 (-|x<i_i, x-, X i+1: X i+2 , X' i+4 ). 
It is a simple exercise to show that 

M\X i+3 ^X' i+3 \x< h 

We get M l by iterating the third and fourth steps (we can iterate them infinitely, so it is 
not a problem if n — i is odd). From the construction, it is clear that 



/ 1 1 e e e 2 e 2 . . . \ 
1 1 e e e 2 ... 



(5.20) 



r — (Xi,j)i,j<n < 

v ... 1 J 

with the inequality meant in each element of the matrix. 

Now, by the simple fact that ||r|| < y^jjrjjTijrjioo, we have ||r|| < so applying Corollary 
2.1 and taking infimum in e proves the result. □ 

Proof of Corollary 2.3. This is an immediate consequence of Corollary 2.1. □ 

Proof of Corollary 2.4- This follows with 4 times worse constant from Corollary 2.2. We get 
this better constant by applying a trick from the proof of Theorem 1 of Janson (2004). 
Without loss of generality, let us assume that N is divisible by r(e), then we can write 

N r(e) N/r(e) r(e) 

i=l j = l i=l i = i 



i.e. we group X> t into r(e) parts. With this notation, it is clear that S = Y^j=l & 
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Now for some j < r(e), X- 7 := (Xj,X T ( e ) + j, . . . , Xjv- r (e)+i) forms a Markov chain, with e 
mixing time r J '(e) = 1. 

Now for such a chain, we can easily see that the Marton coupling M. for can be 
constructed by defining M 1 as the maximal coupling, and that we have 

1 e e 2 e 3 . . . \ 
1 e e 2 ... 



T - (Ti,j)i,j<N/T(e) < 



V 



0... 1 



(5.21) 



/ 



Define c := (b± — ai, 



in 



a N ), and Cj = (bj &Ar_ T ( e)+i - a N _ r(t)+j ), for 1 < 



J < r ( e )- Then by Corollary 2.1, we have for every AgR, 

E(e^) < ex P ^||r •C,f) 



< exp 

\Ci 



A 2 1 



la- 



; (1 -e^ 2 

Denote, for 1 < j < r(e), pj := j2^]\c \\ ' Then, by Jensen's inequality, we can write 



(5.22) 
(5.23) 



E(e AS ) =E(e A ^^) 



(5.24) 



E e 



r(e) 



< 



< 



X 2 



exp 



cxp 



8 (1-e) 2 
A 2 1 



(1-6) = 



3=1 

r(e) 

£113 

r(e) 
3=1 



e p j 



exp 



A 2 1 



'1-e 



:T e c 



From this, by Markov inequality we can deduce that 

-2t 2 (l -e) 2 



P(S - ES > t) < exp 



c t e 



(5.25) 



the same bound holds for the lower tail. Finally, to show (2.11), we need to rescale S by 
N - t , and show that \EZ - E^Z| < (fc ~ a ^ o " (fo) , these are left to the reader. □ 

Proof of Theorem 2.2. Again, let us suppose that X = X and thus s(X) = 1 and N = n. 

Let (Y,X^\ . . . ~ n p,< 3 (see Construction 5.1). For simplicity, in the following, we 

will write II := U P 'Q. Then we have 



d 2 (P,Q) < sup 

*^(Eiof(y))< 



d 2 (Q,P) < 



1 \i<n / 

sup E n V^XjlllfV^l 

: 4 /9?(A:))<l \i<n / 



(5.26) 
(5.27) 
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For a fixed a : A — > R, let us denote Aj := Eg(o;?(X)) (more precisely Aj(a)) and similarly 
A i: =Ep(/32(X)). 

Then we can assume that ^ i<n Aj < 1 and J2i< n ^-i — 1; since we only take supremum 
for such a and (3. 

Let us fix some i < n, and bound the corresponding term in (5.26): 

E n (a 4 (F)l[X l (1 V^]) < (5.28) 

E n (oi(Y) (l[Y ? X«] + 1[X 4 (1) ^ XP] + ... + t[xt 1] * X®} 

=: A l + f^B? 

3=1 



(0 



E ' n 



>i+l 



z?\y<i ) • n (ar^j/j 



y<i-i) ■ n(j/<i_i)l j/i 7^ x. 



.(0 



/ 



E n (y<i_i) E I E a ' 1 ^ ' n 1^* 
v* W«+i 



y<i-i 



(i) 



a) 



Let us evaluate the last term. By the definition of Y\ in IT P '^, we have {X\ , Yi\Y<i-i 
fjima^' ^ \ Y <*-i) ^ thus by the definition of the maximal coupling, we get 



En 



(0 
x\ ,Vi 



y<i-i i 



Vi ± 



(i) 



E 



[Qi(?/i|?/<i-i) - Pi(yi\y<i-i)}+ Pi{%i \y<i-x) - Qi(xf'\y<i-i 



.Ml 



<*rv(Qi(-|2/<*-i)> ^*(-|2/<i-i)) 



= [Qi(j/tb<i-i) - Pi(vi\v<i-i)] + , 

substituting back gives 



A = E n (^<*-i) 'EE ) ' 11 
y<<-i w \i/> i+ i 

• [Qi(?/i|y<i-i) - Pi(vi\v<i-i)]+ 



E q(j/<*-0 E E a ^ y "> ' Q ( y ^ i+i 

y<i-i Vi \f>i+i 

, Pi(yi\y<i-i) 



Qi(yi\y< 



i-lj 



Qi(yi\y< 



i-l 
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< 



<i-l. 



y<i-i 



n 1/2 



1 V2 



< E, 



[Eo[a?(y)|y<i_i]] 



1/2 



1 p t (y t |y<,_!) 



1/2 



< 



[E Q k 2 (F)]] 1/2 [Eg [2D(QMY< % -MPMy^M' 2 <{^ l/2 mYI\ 



i the last steps we have used Lemma 5.3. 
Similarly, for j < i, 



j/,a;(j) ,sb(j'+1) 

= E n ^^) E 1 •//,//-, J E 



y<i-i 



>j+i' 



£ « i (y)-n(^ +1 |y< J ,x^,x^) 
•Il(.r^. | ..r '- n .ry : .^ ; 



• 1 



E n fe-o E n ( 



C?) 

2Cj ,2/j 
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Applying Cauchy-Schwartz to the last sum gives 



E E n (ai(Y)\y<j, x^\x^ +v> ) ■ l[x? +1) ^ a^II ( 



>j+i' 



< 



/ 

I \ 1/2 



1/2 



^>/+i^ (i+i) 



< (E n (a,(F) 2 | if,^))^ • (r^l[xf ^ 

Now we will need the following lemma: 
Lemma 5.5. For any 1 < j ' < n — 1, we have 

n (y>i+i|%'>4 j) ) = n (^>i+il%')- 
Proof. First, we want to show that H(yj + i\yj, Xj) = H(yj + i\yj): 



1/2 



Now by step (j + l).l, we have 



n (%+i | y<j , sj+i 1 ' ) = n (j/j+i I y<j , x \ J +i' , x ) 



and by step j.2, we have 



thus 



,) = E n^+il^-,^,^) • n^l^i,^) = Kiy^y^xf ). 



E i+i 



The next step is to show that Tl{yj + 2\y<j+i) = H(yj + 2\y<j+i,x^): 

n (%-+2|y<i+i) = E u (yj+^y<j+^ x( j+2 2) ) ■ n ( x ?+ + 2 2) b<j+i) 

r U+2) 
J+2 

= E ^ 1+ 2\y< J+ i,x^,xf)-U(x^\y< J+1 ,x^^\xf 



r O+2) 
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= £ n(y j+2 \y^ +1 ^ ^ • 



x j+2 i x j ) ■ Pj+z( x j+2 \y<j+i) 



.0+2) 
l j+2 



r 0+2) 
j+2 



n (%-+2|y<i+i,4 J) ' 



Using this lemma, we can now write 



y<j-i 



U) 



■ (E n (c^F) 2 ! 
By step j.l, we can write 



1/2 



[Qj(yj\y<j-i) - p j(yj\y<j-i))+ l 'Mf //• j i) - QMf //• , 



.-fj (■ |y<j - 1 ) :Qj (■ |y< i - 1 ) 



<W (<5j(-b<y-i) - Pj(-\y<j-i)) 

thus summing up in Xj, we get 



y< 3 -i 



E n fe-0 E 



y< 3 -i 



1 _ p j(yj\y<j-i) 
Qj(yj\y<j-h 



(E n (a J (F) 2 |y< J )) 1/2 g j ( % b< J 



< E n fe-i) ( E n («,(F) 2 | ^-i)) 172 ■ I E 



y<j-i 



\ _ p j(yj\y<j-i) 
Qj(yj\y<j-i) 



Qj(yj\y<j-i) 



< 7j , i (A l ) 1 / 2 (2^) 1/2 . 
Summing up (5.28) in i gives that 



e(^+ E b ?) < E ((a,) 1/2 (2^) i/2 + E uA^m)^ 

i<n \ j <i — 1 / i<n \ j <i — 1 / 



= E T^(A i ) 1/2 (2^-) 1/2 < hllv^Qp), 

i,j<n 

and since this holds uniformly for all a satisfying 



\i<n / i< n 
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(2.14) is proven. 

The proof of (2.15) is analogous, we start from (5.27), and continue along the same lines, 
with (3 replacing a. Equation (5.28) becomes 



E n (AP0l[*fV^]) < 

3=1 

We start with the first term: 



+ i[xt 1] +xf~ 



(5.29) 



= Y ^ x(1) ) ■ ^ • n ( x(1) > • • • *<<) 

• n(x^ , . . . , X i x >i+\\ x i' )V<i) 



A(x (1) ) • ife/i ^ xf] ■ n(^ (1) 



x( 1 ),...,x( I ),y< i 

•nCx^jj/il^i-i) • n(y<i_i) 

Now, from the definition of II, we can see that 

U(x^,...,x^ 1 \x%\xt\y< i ) = U(xW 

thus 



(i— 1) n(i) I (i) > 
, . . . , X )- c >i+ll a 'j iV<i— 1; 



4 = ^ n( 2/ < i _ 1 ) ^ 



V<i-i 



l^X?]-^,^^) 



(0 



^ A(x (1) ) ■ n(x (1) , 



fi— I") n(i) I (i) <, 



iW,..,!^, 



We have 



(i) 



[Qi(y»|y<i-i) - Pi(yi\y<i-i)}+ Pi{Xi ] \y<i-i) - <5i(^f ; |?/<i-i) 



.Mi 



<W (Qi("|j/<*-l) - J°i(-|j/<i-l)) 



D. Paulin/ 'Concentration inequalities for Markov chains 
summing up in y i} and then applying Cauchy-Schwartz gives 



At = e n fe-i) E 



y<i-i 



P^y^-Q^y^) 



(i) 



n(i) 



>i+l 



Wi 



E n (^-)E 



!/<i-i 



id 



E ■ n(*« . . .,x^ 1 \x n ^ +1 \xf ) ,y^ 1 ) 



' ' >z + l 



^? |j/<i-i) 



^(4°|y<i 



< e [Kn(^ W )W] 1/S 

y<»-i 

■n(y< w ) < (2^) 1 / 2 (A t ) 1/2 . 
Finally, for j < i, 



i 1/2 



Pi(arS°|y<* 



^0*1° |j/<i-i) 



-CO / 



(7+1) 



> S/<i) 



E • Ifx? 5 ^ • n (x« . . . ,x^\x^\x^ +1 \y< 3 ) 



•n (.r":/:,..^-^,;^.//. ; ) • n (,-y .^^ ; ; ) ■ win. , ,) 

= E n ^-i) E n {*?>vi\v<i-i) E ^ ^ 



.r 



(i+i)i 



V<j-l 



>j+i' 



■n(xgj. 1 ,x^' +1 )|xf • E /9<(a; (1) )7r(x«,...,x^- 1 )|x^,x^ +1 ),j/< i ) 



sC 1 ),...,^*" 1 ) 



E n (^-i) E 11 (•'■./'• //, //', «) E ^ ^ 



U) j. x 0'+i)i 



>j+i' 



< E n fe-i) E 11 (''/•//,//', ') (En(A 2 (X«)|xf 



1/2 



J/<i-i 



0') _Z y(j+l)l 



GO 

x j >V<j 



1/2 
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1/2 



y<i-i 



E n mX^xf^ 



Here 



E n (fi{XW)\xf ,y<j) = E n [0t{X^)\xf , y^-i 
as we can see from (5.9), therefore 



!/<3-l 



E n (.r/ ., /; ,/., ,) l[sf ^ (E n (^(XW)!^,^! 



1/2 



As previously, we can write 

l[x¥ ) ?y j ]n(x¥ ) ,y j \y&- / 

[Qj(yj\y<j-i) - p i(yj\y<j-i)}+ p j( x j 3 \v<j-i) - QA x j\y<j-i 



d T v (Qj(-\y<j-i) - Pj(-\y<j-i)) 

summing up in yj we get 



»<J-1 



Si) 



■(En (^(^^Ixf,^-^)) 



1/2 



< n,i E nfe-O E 



1/2 



< 7i,< E n fe-0 ( E n W(* (1) )b<j-i)) 1/2 
y<i-i 

/ r .T \ 1/2 



V 



„(i) 



1 - 



QMf y-j ■ ) 



^(*?Wi) 



/ 



< 7j ,(A l ) 1 / 2 (2E i ) 1/2 . 



Summing up in i,j gives (2.15). 

The general case, when 1^1 and thus s(X) > 1, follows by Lemma 5.1. 
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Proof of Corollary 2.5. The following proof is based on page 128 of Ledoux (2001) (see also 
Dubhashi and Panconesi (2009), Section 13.5). First, by the triangle inequality for d 2 dis- 
tance (see Lemma A and B of Marton (2003)), we have for any distributions Q, R on A, 
d 2 (Q,R) < d 2 (Q,P) + d 2 (P,R). Thus, by (2.14) and (2.15), we get 

d 2 {Q,Rf < D(R\\P) + D(Q\\P). (5.30) 



4|| 7 || 2 s(X) 

Now take any A C A with P(A) > 0. Let Q(x) := P{x)/P{A) on x G A and otherwise, 
then 

D(Q\\P) = log (-^-]. (5.31) 



By the definition of Talagrand's convex distance and the d 2 distance, one can see that for 
any distribution R on A, and any Q supported on A, 



Let 



E Y ^ R (4(Y,A)) <d 2 (Q,R) 2 . (5.32) 
4(X, A) 



Z : = Ep exp 



4|| 7 || 2 s(X) 



and 



Using (5.32), we get 



R(x) :=iexp[ | P{ 

z I 4|| 7 || 2 s(a:) ' 



Z) \^\?s{X) 
< log Z. 

~ 4|| 7 || 2 s(a:) 

Comparing this with (5.30) and (5.31), we get logZ < log (p^y), and thus (2.17). □ 

Proof of Theorem 2.3. One could prove this result, with constants 4 times worse, by Theorem 
2.6. The following proof is similar to the proof of Theorem 2 of Samson (2000). 
First, suppose that X — X, then s(X) = 1, and N = n. We have 

N 

f(y)-f(x)<J2<y) 1 i x ^y^ 

i=l 

As previously, denote the law of X by P, and let Y be a random variable taking values in A 
with law Q. Let ir[X ~ P, Y ~ Q] be a coupling of P and Q. With the shorthand notation 
E P f := E P f(X) and E Q f := E Q f(X), we have 

N 

E Q f - E p f = E n [f(X) - f(Y)} < ^2ai{y)l[xi + yMx,y) 

x,y£A i=l 
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Now using the second definition of ^(-P, Q), (2.13), we can see that 



43 



Eq/ - E P f < 
Ep/ " En/ < 







E P 





-,1/2 



d>2(P,Q), and similarly, 



1 1/2 



•cZ 2 (Q,P). 



By Theorem 2.2, d 2 ( y P,Q),d 2 ( y Q, P) < \\-f\\ v / 2D(Q\\P). Denote a 2 (x) := Eii"?^). then 

E Q / - E P / < [E Q a 2 ] 1/2 • v / %ITOP). (5-33) 
Ep/ - E g / < [Epa 2 ] 1/2 • TWlPj. (5.34) 

We will now use the following simple lemma, which follows by the Cauchy-Schwartz inequal- 
ity: 

Lemma 5.6. For A, B > 0, A > 0, we have 

XA B 



Vab < 



2A 



Therefore, we can write, for any A > 0, 

Eq/ - Ep/ < 



A||7|| 2 E Q a 



+ jD(Q\\P), 



These can be rewritten as 

Er 



Ep/ - Eq/ < 



A(/-E P /) 



-A(/ - Ep/) 



A 



| 7 || 2 Epa s 



A 2 ||7|| 2 a 21 



A 2 ||7ll 2 Epa 2 



D(Q\\P). 



<D(Q\\P), 

<D{Q\\P). 



(5.35) 
(5.36) 



Now we will need the following lemma: 

Lemma 5.7. Suppose, that a distribution P on A satisfies for some function g : A — >■ R t/iat 
/or even/ distribution Q on A, 

E 5< J D(giiP), 

ffcen E P (e 9 ) < 1. 



ppj - E ^ e9( x) , then use the definition of |P). 



Proof. Choose 

Applying this to (5.35) and (5.36), we get 



Ep exp 
Ep exp 



A 2 || 7 || 2 a 2 



A(/ - Ep/) 



< 1. 



-A(/-Ep/)- 



A s 



< 1. 



□ 

(5.37) 
(5.38) 
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Now, using the weakly a-self-bounding condition, we can write a 2 (x) < af(x) + b, thus 

A 2 || 7 ||V 



Ep exp 



A(/-Ep/-A||7||W/2) 



< 1. 



E P exp[-A(/-E P /)] <exp 



A 2 || 7 || 2 (aEp/ + 6) 



(5.39) 
(5.40) 



The tail bounds follow by Markov's inequality (for the first case, with the choice A 
t \ 

\\-y\\ 2 (at+aE P f+b)}' 

For later use, for a > 0, we will further bound (5.39): we can write 

A 2 || 7 || 2 6' 



Epexp [(A - A 2 || 7 || 2 a/2) /] < exp 



AEp/ 



(5.41) 



Define the real valued function d : [0, 2^jp] — R as the smallest root of the equation 
A - A 2 || 7 || 2 a/2 = z, i.e. 



d{z) :-- 



then for < z < n _ M 1 . , |2 , we have 



a||7| 



y/l-2a\\i\\*z) 



(5.42) 



2a|| 7 || 5 



Ep exp [zf] < exp 



d{z)E P f 



d(z) 2 h\\ 2 b 



(5.43) 



Finally, it is easy to see that for < z < 2 a\\~/\\ 2 ' 



d{z) < 



(5.44) 



1 — a|| 7 || 2 2; 

In the general case (s(X) ^ 1), the right hand side of (5.33) and (5.34) gets multiplied by 



s(X), thus changing from || 7 || 2 to || 7 || 2 s(X) gives the final result. 



□ 



Proof of Corollary 2.6. Notice that if holds, then / is weakly a-(0, C) self-bounding, while if 
holds, then — / is weakly a-(0, C) self-bounding. Applying Theorem 2.3 proves the result. □ 

Proof of Corollary 2.7. The conditions of Corollary 2.6 are satisfied, with C — 1. □ 

Proof of Corollary 2.8. Let us rewrite (2.26) as 

Z{x) = ^2f i , J(x) (x i ), (5.45) 

i<N 



with 



J(x) := argmax V"/ij(xi), 

^ M i<N 
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then 



i<N 



Z(x) - Z(y) = fi,J(x)(xi) - faj(y)(Vi 

i<N i 
i<N i<N 



i<N 



Now < f(x) < C, thus the a-(l,0) self-boundedness of Z/C follows. □ 

Proof of Theorem 2.4- First, suppose that X = X, then s(X) = 1, and N = n. Without loss 
of generality, suppose that C — 1 (changing to Z(x)/C solves the general case). Then 



N 



s(y) - s(x) < [(/<(v*))+ + (/*(«*))-] • ifo ^ 



(5.46) 



j=i 



thus, similarly to the proof of Theorem 2.3, we can write, for any measure Q on A, 



E Q S - E P S < 



\ 



Er 



i<N 



d 2 (Q,P) + 



\ 



Er 



i<N 



d 2 (P,Q) < 

(5.47) 



where V+{x) := J2i<N(M x i))+ and V -( x ) '■= Ei<jv(/i( x i))-- 
Now by Lemma 5.6, we get 

E Q S - E P S < ^t(E Q (V + ) + E P (V-)) + \d(Q\\P), 



E, 



^(^-Ep^)-^M!(F + + £; F (F_)) 



<D(Q\\P). 



By Lemma (5.7), we get 

Ep exp 



A A 2 |M| 2 
-(S-EpS)-^f-(V + + E P (V-)) 



< 1. 



Now we will use a simple lemma: 

Lemma 5.8. Let A, B > be random variables defined on the same probability space, ff 
E(A/B) < 1, then E(A 1 ' 2 ) < E(B) 1 / 2 . 

Proof. A 1 / 2 = (A/B) 1 / 2 ■ B 1 / 2 , so applying Cauchy-Schwartz gives the result. □ 
By this lemma, we get 

1/2 



Ep exp 



^(S-EpS) 



< 



A 2 || 7 || 2 
Epexp ( ; — 



-E P V. 



so (5.48) 



Epexp[A(5-E P 5)] < [Epexp(4A 2 ||7|| 2 K f )] 1/2 -exp[2A 2 ||7|| 2 J E;pV_] . (5.49) 
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Now V_ is (1,0) - self-bounding, so, by (5.43) and (5.44), we have for < z < , 



4G 



z 



Epexp [zV + ] < exp 

With the choice z = 4A 2 ||7|| 2 , we have, for < A < . 2 , 

4A 2 || 7 || 2 



1 — 11711*2 
i 



Epexp (4A 2 ||7|| 2 V A +) < exp 



-EpV, 



< 



exp 



2||„,||2 



4A 2 || 7 



1-2|| 7 || 2 A 



E P V, 



1 -4|| 7 || 4 A 2 

Combining this with E(V_) + E(V+) < V, and (5.49), we get, for < A < 2v j 7||2 , 

r 2ii7ii 2 i/A 2 ' 

Ep exp \X(S - EpS)] < exp 

y[ 1 h ~ V [l-2 v / 2||7|| 2 A. 

We get the tail bounds by the following simple lemma: 
Lemma 5.9. Let G(X) := \og(Ee x ^ x ^ Ef(x ^). If for every A > 0, 

Cl A 2 



G(A)< 



l-c 2 A' 



(5.50) 



for some Cx, c 2 > 0, then for every t > 0, 

F(f(X) - Ef(X) >t)< exp 



-t 1 



4ci + 2c 2 t 



(5.5i; 



Proof. Apply Markov's inequality for A 



t 



2a+c 2 f 



□ 



The general case {s(X) ^ 1) follows by applying the general version of Theorem 2.2 in 
(5.47), and changing ||7|| 2 to ||7|| 2 s(X). □ 

Proof of Corollary 2.9. This is similar to the proof of Corollary 2.4, using (2.30). □ 

Proof of Theorem 2.5. The proof is similar to the proof of Theorem 2.4. A different proof, 
with slightly worse constants, is possible using Theorem 2.6. 

Again, first suppose that s(X) = 1. 

Let us reformulate (2.36) as 



(5.52) 



i<N 



with 



J(x) := argmax^/jj(xj). 
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Now 



Z(x) - Z(y) = Y fi,J(y)(Vi) ~ Y 

i<N i<N 

^ Y hj{y)(Vi) ~ Y hj{y)( x i) 

i<N i<N 

ifo ^ Vi], 

Define W+(a;) := J2i<N max j<M(fi,j(xi))1, and W_(a;) := J2i<N max j<M(fi,j(yi)) 2 -, then 



i<N 
i<7V 



(fi,j( y )(yi))+ + maxtfijfa)). 

3<M 



E Q Z - EpZ < 



\ 



Q 



,i<JV 



d 2 (Q,P) 



+ 



z — * 



KiV 



N 

From here, similar arguments a in the proof of 2.4, and the fact that E(VV + ) < E(W_) < W 
lead to 

Ep exp \X(Z - EpZ)} < exp 

and thus the tail bounds follow by Lemma 5.9. The proof for the lower tail is similar. 

The general case (s(X) ^ 1) follows by replacing ||7|| 2 with ||7|| 2 s(X). The proof for (2.41) 
is similar, except W+(x) and W~{x) are replaced by X/i<jv max i< M (Aj'( a; *)) 2 : but this ^ oes 
not changes the result (since we have already bounded their expected value by W). □ 

Proof of Theorem 2.6. We can obviously define / : A — y R such that f(x) = f(x) for every 
x G A. Suppose first that / satisfies Condition 1, then / satisfies, for every x, y G A, 



(5.53) 



i=l 



with &i{x) := J2 j ex l (x) a j( x )- 

Similarly, if / satisfies Condition 1, then / satisfies, for every x, y G A, 



i=l 

with «i(x) := Ej6Z,(X) and A(^) := EjeziC*) 



(5.54) 
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Then, with these notations, it is clear that 



V & := E p J2®i(X) <s(X)V a , 

i=l 
n 

V $ := Ep^A 2 (X)< S (X)^, 



^ a (r) := \ogE p e T ^°ZW < g a (s(X)r), 
g (r) := logEpe^-^<^( S (X)r). 

Therefore, in the following, we can make this assumption without loss of generality: 
Assumption 5.1. X = X, and thus N = n and s(X) = X . 
Define, for A > 0, x G A, 

Mx) ^ e X p(A(/M-E/)) pM> (5 55) 

Now we divide our argument into two parts, depending on which condition on / holds. 
Proof for Condition 1. We will use the following lemma: 
Lemma 5.10. Let Y ~ then 

D(/^|P)<2Al 7 || 2 -E MA f>Kn (5-57) 



i=l 



Proof. First step: 



= ^ (A [/(x) - E/] - logF(A) + logP(x) - logP(x)) /i A (x) 

= A[E MA /(y)-E P /(X)]-logP(A) (5.5? 
< X[E, x (f(Y))-E P (f(X))]. 

By (5.54), we can further bound this as 

A[E MA (f(Y)) - E P (f(X))] = A [E w{x ^ Y ^ x) (f(Y) - f(X))} 
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By (2.13), we have 

d 2 (v x , P) = inf sup E n Va^y)!^ ^ FJ 

7r(x~p,y~^) a:E/iAEa 2 (y) < 1 \^ 

inf sup ^- ^E, ( V m {Y ) 1 [X< ^ y f ] ] , 

^p,y^ A ) a (E, A £a 2 (r)) 1/2 Vti 7 



thus 



Z% A ||P) < A (E MA 5> 2 (F)) d 2 (^P) 

< x (e„ a J2 o$<X)) 1/2 INI V2W), 

the statement follows by rearrangement. 

We need to further bound E MA Yli=i a l(Y) : 
Lemma 5.11. For any t > 0, 



x\\Q)+g a (r)). 



Proof. Let us define Q as 



O(x) = ex P( r £"=i a ^) . p( x) 



then 



0< J D(/iA||g) = ^/iA(x)log v 

n 

tix{x) i log(// A (z)) - log(P(z)) - r ^ ct 2 (x) + log (Epe rE * 
£>(/i A ||P)-rE M , ( ) +log (e p (e^ 



so we have 

^ ( J> 2 (X) ) < D(ptx\\P) + log (e p ( e -£E*«?W 



and thus (5.60) follows. 

Combining the two lemmas, we get 

1 



D( yUA ||P)<2A 2 || 7 || 2 --( J D(^||g)+^(r)) 

n , „ m ^ 2A 2 || 7 || 2 -L(r) 
DW\P)< r _ 2A2 ,| 7 || 2 » 
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for every A > 2A 2 ||7|| 2 . 

We continue with the Herbst argument (we use (5.58)): 

d_ (G(X)\ _ XF'(X)/F(X) - G(X) 
dX \ ~ A 2 " 

= I [E, JiY) - E P f(X)] - 1 logF(A) 

\iW)< 2|l7l|2 ^ (r) 



A 2 y - r-2A 2 ||7|| 2 

A 

G(A) <2A 2 H 2 - S «(r) 



The right hand side is increasing in A, and lim^o — f^- = 0, therefore 



r- 2A 2 ||7 

thus (2.42) follows. 

For the lower tail, we will need 

Lemma 5.12. 



D{v x \\P) <2A 2 ||7f ■EpJ2^(X). 



i=i 



Proof. 



D(u x \\P) = £log(^W) 

= (" A - E /] - log^(-A) + logP(x) - logP(x)) l/ A ( 



x6A 

= -A[E,J(F)-E P /(X)]-logF(-A) 
< -A[E„(/(y))-Ep(/(X))]. 
By (5.54) and (2.13), we can further bound this as 

A[E P (/(X)) - E ux (f(Y))} = A [E n(x ^ vx) (f(X) - f(Y))] 



\i=i 

/ \ 1/2 

< A(E P ^a 2 (X)J d 2 {P,v x ) 

/ . ^ \ 1/2 

< A (e p £ a 2 (X) J || 7 || V / 2%F). 



The Herbst argument in this case (see (5.64)): 

d (G(-X)\ -XF'(-X)/F(-X)-G(-X) 

dx \ x ) ~ a 2 

= -I \E„ x f(Y) - E P f(X)] - llogF(-A) 
= ±D(u x \\P)<2\\j\\ 2 V a , 
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and thus (2.43) follows by integration. 

Proof for condition 2. The proof is based on the following lemma: 
Lemma 5.13. For every A > 0, r > 4A 2 ||7|| 2 , 



Proof. 



D{px\\P) < r ^ 4 2 ^ ||2 (g a (r) + rV ) . 
D(fj, x \\P)<\.[E^f(Y)-E P f(X)] 

<A|| T || ( Ht*UY)) 



1/2^ 



■\/2D(^\\P). 
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□ 



(5.65) 



By Lemma 5.11, we get 



E, x [J2^(Y)) <l(D(f, x \\Q)+g a (r)), 



. i=l 



and thus we can write 



D(nx\\P)< A||7ll ( ( z (£>(//a||Q)+^(t)) 

V2Z>(/* a ||p) 



1/2 



. «=1 



£>(^||P) < 4Al 7 |r ( - (D(ji X \\Q) + </ a (<r)) + Vp 
D{^ x \\P){t - 4A 2 || 7 || 2 ) < 4A 2 || 7 f (g a (r) + rV» , 
this implies (5.65). 

Now (2.44) follows by the Herbst argument. The proof of (2.45) is similar. 



□ 
□ 
□ 



Proof of Corollary 2.10. Let X^, . . . , X* be a stationary Markov chain with the same prob- 
ability transition matrix, then it is easy to see that 



d TV (£(X t0+1 ),£(X; a+1 )) < mi^e 



I f 
L'mixM 
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therefore, by the Markov property, we can make a coupling between (X to+1 , . . . ,X n ) and 
(X* Q+1 , . . . , X*) such that 

p ((x t0+1 , ...,x n )? (x: o+1 , ...,K))< ■ 

Now applying Theorem 1.1 of Lezaud (1998a) to X* +1 , . . . ,X*, and using Proposition 1.1 
proves the first result. 

For the second result, we need to modify the proof of Theorem 1.2 on page 47 of Lezaud 
(1998b). 

Lemma 1.1. of Lezaud (1998b) shows that for any r > 0, 

F[Z- EJ >t}< e r N q exp {-n (rt - log (A,(P(r))))} , (5.66) 
here flo(P(r)) having the following Taylor expansion, for < r < |: 

oo 

/3 (P(r)) = l + ^/3("V", 

n=l 

with /3« = 0, (3^ = a 2 /2, and ^ < (V f /5)(5/j) n - 1 for n > 3. 

From this point (using Proposition 1.5 on page 48, which shows that a 2 < 2Vf/j), Lezaud 

(1998b) shows that /5o (P(r)) < 1 + -^-r 2 ^1 — ^\ , and thus obtains a result depending on 
Vf and 7 only. Here, we take a different approach: 



9 

MP(r)) < 1 + °-r' z + 



n=3 V ' 7 / 



Denote K := and K' := K — -, then AT > |, AT' > (by Proposition 1.5), and 

0.2 2 / \ 2 r 2 (1 + ATV) 

W))<i + — + ^ + \_ 5 _ r ] 

7 / 7 



, a 2 r 2 (1 + K'r) 
< exp I — • 



2 1 - Sr 

7 



Finally, we apply (5.66), with the choice of r as the positive solution of 

a 2 r 2 (1 + K'r) rt 



2 1 - s-r 2 

7 



Solving this quadratic equation gives 



a 2 + + 4a 2 i^ - (a 2 + it) 
2a 2 A' / ' 

noticing that with this choice of r, rt — log (Po(P(r))) > ~, proves the result. □ 
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Proof of Theorem 2.8. The proof is based on the trick of Janson (2004), as in the proof of 
Corollary 2.4. 

Denote t n = 5^Li/(^Q), then, by page 857 of Lezaud (1998a) and page 52 of Lezaud 
(1998b), in the case of initial distribution q, we have 

E g ex V (rt n ) = q T P(r) n l < N g \\ (P(r)*) n P(r)"||^ 2 2 < N q f3 (r) n / 2 , (5.67) 

where /?o( r ) denotes the largest eigenvalue of the operator K{r) := P{r)*P{r). 
It is then shown that 

AV f 2 A 10r 



and the tail bound follows by Markov's inequality. 

First, suppose that X\, . . . , X N is stationary, then q = ir, and N q = 1. Let S = J2iLi fi-^d- 
Let us fix some integer k > 1, and divide f(Xi), . . . , f(X^) into k parts: 

{f(X 1 ), f(X k+1 ), ...,),..., ({f{X k ), f(X 2k ), ...,)). 

Denote the sums of each part by Si, ... , S k , then S = Yli=i ^k- 

Suppose, without loss of generality, that iV is divisible by k, then for each % < k, applying 
(5.67) and (5.68) gives 

/ N 2V f r 2 ( lOr \ _1 \ 

Ee xp( r S i )<exp^.^^(l-- p ^j j. (5.69) 

Jensen's inequality shows that Eexp(rS) < \ Y2i=i Eexp(r/cSj), thus 

(2Vfr 2 ( lOr \ 
N ■ -T, rr — T777" 1 " ^7 w Iw , • (5.70) 

j((P*) k P k )/k V -i{{P*) k P k )/k) ) v ; 

This means that the statement of Theorem 3.3. of Lezaud (1998a) holds with j(K) replaced 
by j((P*) k P k )/k. Optimizing in k gives the result. The non-stationary case can be handled 
the same way as in Corollary 2.10. □ 
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